EDRM Evergreen/Processing/Intake

From Working EDRM

Jump to: navigation, search
Comments: Please submit comments to the EDRM Evergreen Processing forum

Categories

Pre-planning is an important component of any good project management methodology, and the same is true for the methodology of managing electronic discovery processing projects. Decisions made in the identification, preservation, and collection phases of the electronic discovery life cycle dictate some of the requirements and activities that will be undertaken during the processing phase. For example: the types of data preserved and collected, the amount of data contained in those collections, as well as the time frames for production may have been decided prior to the act of processing electronic data or engaging with a provider to process electronic data.

Other influences on the requirements for the processing phase include the needs of the review and production phases. Items such as the schedule for review and the technologies used to review the data will influence the way the data must be processed. Each of the decisions made in these areas will influence the costs of the electronic discovery processing project. By taking each one of these elements into consideration prior to undertaking the Processing phase, the consumer can make informed decisions in line with the legal and financial objectives of the project.

Contents

Determining Scope

Electronically Stored Information (ESI) is discoverable under a number of statutes, including the Federal Rules of Civil Procedure, States Rules of Civil Procedure, Federal criminal rules, state criminal rules and case law addressing electronic documents. These rules and case law are an increasingly large factor in internal and governmental investigations. New amendments to the Federal Rules of Civil Procedure (FRCP), in effect since December, 2006, now require organizations that operate within the U.S. to manage their electronic data so it can be produced in a timely and complete manner.

Parties are able to request of each other “any designated electronically stored information – including writings, drawings, graphs, charts, photographs, sound recordings, images, and other data or data compilations in any medium…” Typically, in a meet and confer session, the legal team (outside and in-house counsel) determines the list of custodians, they types of data, the form or forms in which it will be produced and timeframes that are required for the matter at hand. Each type of data available within the appropriate timeframe must be processed, reviewed and produced for each custodian. For example, for a particular matter, there may be 100 custodians who have data collected from their email (servers or local mailbox), hard drive, and shared network storage location over a six-month period.

Each piece of data must be accounted for in every stage of electronic discovery processing. What is commonly known about the electronic discovery project by the time of processing? Typically, we know the list of custodians that have data relevant to the matter and we know the time period in which the responsive information is likely to reside. By employing tools we can analyze the type and contents of that media. And by understanding the goals and requirements for the review, we can determine the processing specifications. These are the factors that determine how much data will be processed and when. Corporations and their law firms can structure their reviews based on this analysis. These pieces of the puzzle help consumers control and manage their electronic discovery projects. (See Cost Drivers.) The requirements for processing electronic discovery data can range from the simple to the extreme and from small datasets to large volumes. The schedules for completing the processing can also run from short (days) to long (months) durations. The result of these large variables means that determining the scope of the task in terms of the types and size of the data is critical in identifying the resources needed to complete the processing. Typically the fixed variable is the schedule. Large jobs that are required to be processed in short periods have resulted in costly fines and awards when the resources to perform the processing were not available. The processing environment, whether it requires sequential or distributed parallel processing and the degree to which the environment is able to scale across one or more processing jobs can also be critical in determining the scope of processing to be performed as well as the schedule. In short, dataset sizes and types, timeframes for processing, and the complexity of the processing requirements drive the scope of the project. This scope will ultimately drive the cost. (See Cost Drivers.)

Common States of ESI

On-line ESI

Information that is frequently accessed or is of high intrinsic value is generally kept on-line. Examples may include document templates, presentations, financial and customer information. This type of information may be stored on a local computer hard drive, or a private or shared network directory. Some of the characteristics of this type of information are:

  • The information is readily accessible
  • Depending on how disruptive to operations collecting the data is, this type of information can also be one of the least costly to collect
  • It is “volatile” – it can easily be altered or destroyed

Off-line ESI

Off-line ESI is information that, for a variety of reasons, including disaster recovery or information archival, is not kept in a readily accessible environment. The most common medium upon which this type of information is stored is magnetic tape. Some of the characteristics of this type of information are:

  • Less readily accessible
  • Generally more costly to access
  • Less volatile in that it is not easily destroyed without a deliberate act.

Near-line ESI

Near-line ESI is information that is accessed regularly but with lower business utility or value than that of on-line. Examples may include recent email archives or ESI subject to a document retention policy that specifies that certain information beyond a certain date should be kept on inexpensive secondary storage devices.

Common Forms of ESI

Structured ESI

The most common form of this data is uniformly fielded information stored in database tables in a tabular format. Practical examples often encountered in electronic discovery include accounting and transactional databases which may contain hundreds of (or more) uniformly structured tables.

  • Each row in a database table is a record.
  • Each record in a table possesses the same uniform field structure.

Unstructured ESI

Data stored on a “file system.” The folder structure on your home or office computer’s hard drive and the files on them comprise an unstructured data collection. Your network share drives and file folders typically store “unstructured information.” Unstructured ESI includes word processing documents, spreadsheets, presentations, audio and video files, text files and more.

Semi-structured ESI

Email is the most common form of semi-structured data that is encountered in discovery. It is semi-structured because a part of the data record, i.e. the email itself, is structured and the attachments, if they exist, are unstructured.

Dataset Sizes and Types

The custodian list is another important step in establishing the scope of an electronic discovery project. The number of custodians can (and typically will) change several times during the engagement. For purposes of authentication and chain of custody, mapping users to the data source is critical. Since each custodian has the potential to generate many gigabytes of (or more) responsive data, the number of custodians is a key variable in determining the overall magnitude of any electronic discovery task. Coordinating with the legal team, the collection team, the processing team and the review platform team is critical to success. It is often appropriate that there be a single point of contact for coordinating the custodian list in order to assure consistency in the naming conventions and updating the list.

Media Analysis Tools

Electronic discovery processing vendors and software packages have tools designed to analyze the contents of various pieces of media and provide clients with reports regarding the number, size, and data types contained on that media. Understanding the data collection at a more granular level allows corporations and their law firms to have more complete information on which to base decisions, as well as provide more accurate time frames for processing.

For example, it is intuitive that the larger a collection is in size, the longer it will take to process. However, the blend of the file types contained within that collection will also drive processing requirements. If you have two collections containing 10 GB of data, and one of those collections contains 50% spreadsheet files and the other contains 5% spreadsheet files, the collection with the larger percent of spreadsheet files will take longer to process and will probably result in more pages for attorneys to review.

The scalability of the processing environment, including the need for and ability of the applied technology to segregate redundant files and to handle embedded or nested data types, must be clearly understood and closely matched with the delivery requirements for the matter. The selection of media analysis tools varies on a case-by-case basis. Increasingly native files are requested, while others require TIFFs. Determination of the requirements for processing complex data types, such as documents and spreadsheets embedded (or nested) within e-mails, can place extreme requirements on the scope of the processing to be performed. The determination of the correct media analysis tools, based on the scope and requirements of the job, is critical to processing performance and the completeness of the discovery.

Processing Specifications

Processing specifications drive the scope and therefore the timeframe and cost of processing and reviewing electronic data. Agreement must be reached not only on what data is to be processed, but also on what the input and output format of that data will be. Chief among those specifications is the type of output that the review requires and what processing will be required to provide that output.

There are several options for output from files: the metadata associated with the file, the searchable text contained in the file, the native file itself, as well as a rendering or image of the file. Each of these components contains vital information about the file that is being reviewed, and depending on the review system and type of review the law firm or corporation is undertaking, any combination of output can be provided.

For example, in the case of an internal review by an in-house counsel, a corporation can elect to review the metadata and text associated with a collection of files to determine if a more extensive review is warranted. In contrast, for a Hart-Scott-Rodino (HSR) 2nd request, where review and production deadlines are mandated by regulation and under tight time frames, a review team may require all required forms of responsive files be output in the first pass. Review needs and matter specifics guide the law firm’s and corporation’s choices for output.

Since processing specifications can aid reviewers by providing fielded information that can be used as a first pass at tagging and coding data, legal teams have additional points to consider with respect to review needs. For example, it is possible to determine during the processing phase, which documents were sent to or received by specific email address. If email was sent to or received by a counsel, those documents could be provided to the review team already marked as potentially privileged documents. By considering these options, the review team can save time, and therefore cost, in their review. (See Review Node.)

One of the benefits of processing data automatically is the ability to hone datasets to a more manageable and responsive set, based on a specific set of criteria. Data culling options are tools that shape the electronic discovery project significantly. The effectiveness of the culling strategies employed is one on the most significant contributors to overall project cost. Any data that can be culled from a collected data set – and therefore does not require attorney review – saves significant review costs. Many times the culling rates ultimately exceed 90%; in large data collections, incremental percentages of culling effectiveness can save millions of dollars in review costs.

There are several options that are widely accepted and used to manage increasingly large collections of electronic information. The goal of data culling is to leverage technology in a legally-defensible manner and reduce the collected material to a more manageable set of potentially relevant documents for review. Any culling methodology will not be considered reasonable if it operates to exclude relevant documents pre-review. Technologies have been developed to assist in this culling of data through suppression of duplicates, called “de-duplication,” filtering of documents based on their attributes (metadata) such as date ranges, file types, as well as searching for key words, phrases, and concepts. During this pre-planning stage, the choice of culling options to be employed will direct assumptions regarding how much of the data collection will be segregated prior to review, affecting subsequent processing and review costs. (See Culling, Prioritizing and Triage and Searching sections for a more in depth review of options available.)

Once decisions are made regarding what will be processed, there are myriad processing options available for the resulting files in the dataset. Some of these options refer to how a document is presented. For example, an email message can be presented in text format or it can be presented in a format consistent with the look of the file in its native email application. Also, clients have an option to determine how time zone dates and times are represented based on where the data was collected. One method of handling time zones is to normalize the entire dataset to a standard time zone, such as GMT or the time zone where the review team or corporation is located. Technologies also exist in the market to customize the time zones for subsets of the data collection, such as by user, using a time zone the legal team feels represents the data most appropriately.

There is also a collection of processing options related to how documents can be presented in image form. Often files (particularly e-mail files) will contain attached or embedded files. These files must also be identified and processed while maintaining parent/child relationship hierarchies. Determination must also be made for the handling of files containing algorithmic data or formulae, such as excel spreadsheets and access databases. The intent of these options is to represent the file in a way that is meaningful to a review team, while ensuring that all of the data is represented.

For example, the review of a spreadsheet in its native format can be much more meaningful to a reviewer than would be a converted format (e.g. TIFF or PDF) because of the accessibility of hidden formulas within the fields. Similar problems exist for processing animated files such as PowerPoint presentations. Care must also be used when presenting these to a review team. By providing the team with options in the scoping phase, review teams can make decisions that will affect the ease of document review and ensure they receive the data in the manner that they expect.

Examples of Processing Options by File Type:

  • All Files
  • Threshold files larger than specified size
  • Auto date function options
  • Header and Footer options
  • Provide in image or native format
  • Email Files
  • Text representation
  • Native representation
  • Display archive and path on image representation
  • Word Processing files
  • Reveal comments
  • Show edits and tracked changes
  • Paper size and orientation options
  • Presentation Files
  • Print speaker notes or slides only
  • Objects auto fit to printed page
  • Spreadsheet Files
  • Eliminate blank pages
  • Print area options
  • Reveal comments
  • Format custom column titles

The goal of scoping the processing specifications is to inform the legal team as to the options to ensure that the dataset is treated in a manner consistent with the needs of the matter and the review. It is optimal to iron all issues out prior to processing the data. However, the scope of litigation can change quickly. As part of the change management process, it is critical that the processing specifications should be meticulously documented throughout the electronic discovery process. If changes are made midstream, the date, specific change and person authorizing the change should be documented. Good change management will ensure that the data is handled in a legally defensible manner.

Estimating Scope and Delivery Schedule

It used to be difficult to estimate time and cost in the earliest phases of an electronic discovery project, but the advent of more recent automated indexing and classification tools has greatly enhanced the ability of in-house counsel to make a quick assessment. (See Identification, Collection and Analysis Nodes for more information.)

Cost estimates during the processing phase pose a challenge since specifics of the matter change, such as the list of custodians may, as the data is collected and more is known about the case. Expect the time and cost estimate to change as you progress in the electronic discovery process. However by employing good planning techniques, understanding the data to be processed, and understanding the processing options and impacts of those choices on the delivery schedule estimates can be made using industry benchmarks (or client specific benchmarks where available).

Certainly as the scope changes, these estimates must be adjusted. Staying informed during the electronic processing project will keep everyone up to date as to the project’s status and change management implications. Litigation sometimes seems to move at the speed of light. Electronic discovery projects are often initiated under tight timeframes and intense scrutiny. These factors provide even more reason to gather as much information as you can at the beginning of the process. By collecting the information discussed above, corporations and their law firms can arm themselves with information about the data that can assist them in making cost effective and legally savvy strategic decisions.

By being proactive during the ID phase corporations can ensure to get all relevant data as it pertains to the investigation (or any electronic discovery need, either internal or external), as well as eliminate irrelevant data. This will allow for better time and cost estimates for processing, while mitigating the risks of potential sanctions due to missed evidence, spoliation, or missing deadlines due to processing more data than is necessary.

Proactive preparation for potential electronic discovery requirements prior to the initiation of specific electronic discovery requests is also possible. Some corporations have begun to apply indexing and classification of ESI against business rule in order to processes to electronic data in advance of a specific event, for the purposes of information security and privacy, compliance or overall risk management.

Requesting and Negotiating

Requesting and negotiating are a critical component in undergoing any electronic discovery project. This fact is validated by the changes to the FRCP that mandate a meet and confer meeting early in the process of undergoing any discovery action. This meeting dictates the amount and type of information that is subject to processing further along in the lifecycle of the litigation. Understanding the relationship of this process and its downstream effects on processing timeframes and costs can make a consumer of electronic discovery processing services better prepared to make cost effective and efficient decisions. Additionally, the provider of the electronic discovery processing must be able to adapt to the realities these decisions make on the processing specifications.

As stated earlier in the discussion regarding the scoping of the project, it is important to ensure that when you are ready to begin processing the data your methodology is consistent with the strategy of the corporation or its law firm. The strategy should be clearly stated and delivered in writing along with the media to be processed. This would hold true regardless if you are using an electronic discovery services and/or product provider or are handling these responsibilities in-house. When determining the type of electronic data to collect for processing, it is important to develop a strategy. The combination of keywords and various properties that each computer file possesses will define the collection scope in an electronic discovery investigation. A single method or a combination of methods is often used to create the appropriate data preservation and collection strategy. A good strategy is documented and shared with all providers of services throughout the lifecycle of the electronic discovery project.

Depending upon the tools used, the filtering criteria can be done upfront (See Collection Node) or at the backend after the data is collected according to the data culling techniques described herein. When and how much to filter will depend on several factors including whether a re-collection may be necessary and what can be agreed on upfront before the collection begins. Filtering upfront has many advantages including reducing the total data set to process, avoiding privileged documents and avoiding the collection of irrelevant and non-responsive documents.

Questions to ask when determining filtering criteria:

  • Can the search be limited to exclude selected folders/directories?
  • Do you need to review temporary internet files?
  • Can the search be limited by metadata attributes?
  • Can the search be limited by file extensions/types?
  • Can the search be limited by Dates and/or Times?
  • Can the search be limited to active files?
  • Can the search be limited to a specifically named document(s) ?
  • Can the search be limited to keywords?
  • Are you concerned that the custodian may be hiding data by renaming file extensions?
  • Are you looking for a specific document? If so, do you have an electronic copy of the document that can be used to find any and all exact matches?
  • Are you looking for all files created by a certain person (user created files)?
  • Are multimedia files of any relevance to this investigation (audio, graphic, video)?

Evaluating What Has Been Received

By the processing stage in the lifecycle of the electronic discovery project, the custodian list and data to be processed for review has usually been determined. However, once the actual review of the documents begins, the legal team begins to really get a sense for what issues and topics are revealed within these documents. Based on what the review team sees, they may deem it appropriate to request additional documents from the existing custodian list or may add custodians to the list. Data sampling is an excellent way to determine the quality and accuracy of the information collected and how to best process the evidence. Data sampling may be used at both the collection and processing stage to insure a defensible collection and culling strategy. The key here is to have a set of procedures that you follow, without fail when evaluating what you have received to be processed. The procedures need to be in writing and the steps need to be documented. Consistency in the treatment of the data and the process is important to a defensible strategy.

The importance of the Processing team having clear communications and collaboration with the Identification, Preservation and Collection teams/nodes is crucial to successfully understanding what needs to be evaluated. The changing parameters of custodian lists and priorities based on evaluations of reviewed material can have a significant downstream effect on the processing phase. The timeframes at this point in the litigation cycle are usually not able to be extended to account for these changes. (For more information, see Analysis Node.)

Culling, Prioritizing and Triage

Culling and searching occur throughout the discovery process and are tightly related components of any solid and defensible processing strategy. Simply defined, culling is the process of programmatically removing content that is irrelevant while searching is the process of identifying content that is most likely relevant and will require review. Together, when implemented effectively, culling and searching will reduce and focus the reviewable content universe—saving clients time and money for higher value downstream activities. More than 90% of collected electronic content can be non-responsive.

When developing a culling and searching strategy, the objective should always be to identify the most relevant content first and move it downstream to the review team. To accomplish this, the discovery team needs to develop a general understanding of the collection. This understanding includes answers to questions such as: How much data will we be receiving? How many custodians? What is the average amount per custodian? What types of files (e.g. emails, word processing documents, spreadsheets, etc.) will we be receiving? How will the data be delivered (e.g. on hard drives, DVDs, etc.)? There are myriad approaches and technologies for culling and searching. The key is to find the tools that best work for your unique requirements in a particular matter. Prior to selecting a particular culling and/or search technique or tool, it is important to understand what objectives the review team is trying to accomplish.

Understanding Background to Set Priorities

The amount of data which is created is almost immeasurable and increasing all the time and this is why it is extremely necessary to have a triage and prioritization strategy so you can move ahead in a timely and efficient manner. As it has already been previously discussed in the Collection section, the proper analysis of the key issues and collections can help save time and money exponentially down the road in the discovery process.

The collection of information was originally based on such issues as:

  • Key custodians or issues
  • Preparation for key legal dates (depositions, hearings, filings)
  • Ease of collection drove the collection of the data
  • In the triage and prioritization section we can employ those basic factors and then become more granular to address more technical issues which result from more detailed management of electronic data.

Planning

The volumes and types of data even after collection and culling can be quite overwhelming. Like any other large project a methodology is important to follow to break down the work ahead. The process should be followed to create priority and full use of throughput:

  1. Understand key legal issues to identify and process the most critical data as the top priority. Plan on processing data in line with key legal deadlines
  2. Identify potential issues and address potential changes with court or other side
  3. Prioritize by key positions, departments and then by specific department members.
  4. Analysis of data types and accessibility (i.e. Backup or legacy types). Review team factors such as availability of experts, legal subject expertise, financial analysis, and foreign language.
  5. Even with excellent culling techniques the data needs to be further evaluated to try to further understand the data to possibly further eliminate the data, identify future challenges, and process certain data. A critical analysis of the types of files in the collection can help understand even the largest of collections. Understanding the file types can give insight into what type of applications the users used as well as speed and challenges in the review process. (back to top)

File Analysis

File analysis is a process where an application is used to give statistics on what types of files are in the collection. This application can be used as an in-house tool, by a forensics expert, or by an electronic discovery vendor who may be processing the data. Some tools use the file extension of the file but more sophisticated tools will analyze the file’s header information to determine the file type regardless of the file’s extension. It is possible to rename file types, such as document.old, or to rename files to try to hide critical information. Tools that identify file type by file header can be useful if renamed file types are suspected. Types of files can help determine the number of files which would be processed for review and those which would not. There are file types which can be user generated in the normal course of business and those which are non-custodian created files such as files which come with a computer’s operating system. In most cases processing the custodian created files are the files which will move on to be processed and reviewed. It is important to keep as much of a custodian’s data processed together going forward so that it can be managed as a whole going forward in the processing and review. Data at this point will become data that can be processed, those that will not, and those with special handling needs. Special handling needs include files which need special non-standard applications to process or view. Companies develop their own internal application or specialized applications such as accounting or computer automated design (CAD) programs, both of which can be critical in a matter. Certain files or certain custodians might further forensics analysis depending on the nature of the matter which would entail a forensics professional working with the collected data or going back to the original media to avoid spoliation.

Review Team Factors

Knowing that certain file types will prove to be difficult can help delaying other information in the process. Files which contain complex data such as relational databases can be organized out of the standard process so that analysis can be done for true responsive data and proper formatting can be created so that beneficial information formats can be created for proper production. Prioritization should also take into consideration the issues relating to the people who will be reviewing the information. Review team issues need to be considered in the prioritization process such as legal expertise, domain knowledge (i.e. Scientists or accountants), or foreign languages.

[updated Jan. 29. 2008]

Personal tools
additional information