Validation of results is an important phase of search. Some of the overall goals of this phase are the following.
- Ensure in a cost-effective way whether a set of searches performed are satisfying a production request. When producing very broad searches, it is often difficult to perform a large-scale human review, so the validation phase should provide the necessary evaluation without consuming too many review resources.
- Ensure that the validation produces enough results in a timely way, to assess and evaluate whether we need to modify the initial set of searches (i.e., assist in the feedback loop to the Execute phase of search).
- Allow for comparison of alternative search methodologies.
- Support the needs of EDRM Metrics in terms of tracking and feeding processing and analysis metrics.
Measures for validating results depend on the overall goals of the e-discovery production. In particular, the following considerations apply.
- When searching for responsiveness review, validation of results may need to consider whether the overall goal is to be restrictive or over-inclusive. Depending on the case situation, this may drive the evaluation. While certain documents would have a clear-cut responsive determination, some would not. Over-inclusive strategy would include those that are not clear-cut responsive. A restrictive strategy would discard these. In some cases, the cost of human review and overall budget may impact this choice.
- When searching for privilege review, a strict validation step may be required, causing a feedback that makes the searches more broad in selecting results. However, this may be subject to consideration when claw-back agreements, selective waivers, and FRS Rule 502 agreements are in place – i.e., the privilege search may be stricter.
- When evaluating whether it is necessary to expand the ESI collection (perhaps adding new custodians), validation of searches may need to evaluate the search results in the context each collection batch. Quite often, the tiered collection of ESI results in some batches to produce larger number of responsive document hits compared to other batches. Thus, it is necessary to properly document and compare search results.
9.1. Search Results Descriptions
To validate search results, the following needs to be captured:
- The search query that was submitted.
- The number of documents that were found.
- The number of documents that were found to be duplicates of other documents.
- If a document is contained in another document (Emails and attachments etc.)
- The number of documents that were searched.
- An identification of each document (using MD5 or SHA1 hash content).
- The number of hits within each document.
The above results may need to be classified across additional lines such as below.
- Custodians where the results were found.
- Loading batches where the results were found, so that search results are tracked per-batch.
9.2. Validation Methodologies
There are several validation methodologies that should be used throughout the development of the search criteria to be used for selection of documents for attorney review. Many of the below-referenced validation methodologies involve the case team in reviewing samples of documents to determine litigation relevance to classify documents as Responsive or Not Responsive to the issues of the case and therefore increasing the precision of the search results.
9.2.1. Frequency Analysis
Initially, frequency analysis may be used to evaluate the effectiveness of the initial search criteria. The search terms are tested to determine whether they effectively discriminate between potentially relevant and clearly non-relevant data. Think of this as a reality check on the search results versus the overall collection size and the reasonably expected proportion of relevant results. If the collection consists of ESI manually designated as relevant by custodians, an 80% response rate might be reasonable. Whereupon if you are searching across the combined departmental mailboxes and file shares, you would expect a much smaller result set.
This method involves reviewing the proportionate counts of items returned using the initial search criteria set and individual search terms. Depending upon your search technology, it is useful to be able to evaluate the overlapping search terms and see which items only received one or multiple hits.
Once you have identified overly broad terms, samples are used to develop valid qualifiers or exclusion terms that may be used in combination to focus or narrow the search. Non-discriminating criteria, those terms that are over-inclusive and are not likely to yield responsive documents may be removed.
This analysis also identifies search criteria that fail to retrieve any data or fail to retrieve the quantities or types of data expected to determine whether these may need broadening or further investigation.
This process is used iteratively throughout the life cycle of a project as search criteria are modified. The goal of this method is the increase the relative precision or proportion of relevant items within the search results. It does not address the recall or completeness of relevant items out of the collection.
9.2.2. Dropped Item Validation
As the search criteria set is being updated and modified during the initial investigation and analysis, dropped item analysis is another form of validation needed to ensure that Responsive items are not being inadvertently omitted through changes to the search criteria.
This comparison would sample documents that were originally results of one search criteria set but are no longer results of the modified search criteria set.
The case team would then review the samples of dropped items for responsiveness to ensure that Responsive items had not been dropped. If Responsive items are identified, they should be reviewed to determine whether additional terms need to be created to capture these items or if modifications made to the criteria should be changed so these items would still be included.
The appropriate number of random or statistical (example: every 20th item) items sampled is discussed in Section 9.5. Sampling checks should be repeated after the criteria have been modified until the team is satisfied that the threshold of confidence has been reached. Dropped item validation can be performed in combination with the following ‘Non Hit Validation’ sampling method, but items that fell within previously responsive search criteria should have a higher threshold of acceptable error.
9.2.3. Non Hit Validation
As search criteria are being developed and used to move resulting items forward into document review, samples should be taken of items that did not hit on any of the search criteria being used. The case team should review the non hit items for responsiveness. As with the Dropped Item Validation, any documents that are deemed Responsive during review will need to be evaluated to identify additional search criteria or modifications needed to existing criteria to capture these items to move them forward into review.
As with the Dropped Item Validation, any sampled documents that are deemed Responsive during validation review will need to be evaluated to identify additional search criteria or modifications needed to existing criteria to capture these items to move them forward into review. These methods of sampling the items outside of your results are critical to building a defensible process. They demonstrate reasonable efforts to ensure that everything responsive has been found. In comparison to the Frequency analysis, these methods improve the search recall (proportion of relevant items versus all relevant items). Using these methods decreases the risk of inadvertently missing relevant items, but will increase the total volume for review.
9.2.4. Review Feedback Validation
Documents in the review set are reviewed by attorneys for responsiveness, privilege, and other issues involved in the specific matter. Feedback from the review – i.e., calls by reviewers about which documents are relevant or privileged – can provide additional information useful in refining the search and selection criteria or in identifying gaps that require additional analysis. This feedback will be used for additional analysis and to refine the Search Criteria sets. The feedback may identify categories of documents that are not yielding responsive documents. This information will be used to develop exclusionary criteria that will identify documents to be excluded from the review set. Also, the feedback may identify new categories of documents that should be included and the criteria will be broadened to include those documents in the review set.
An example of Feedback Validation would be to do a complete review of a key custodians collected ESI and then compare the relevant or privileged documents against the search criteria frequency metrics. So if a particular search term appears in 80% of your relevant documents and not in any of the non-relevant documents, that term should be used to focus your narrow set of known relevant criteria for first pass review. Depending upon the reporting capabilities of the search system, it may be necessary to segregate and re-process the different categories of reviewed items to generate the feedback metrics.
The validation method can be used to improve both precision and recall of searches. It can also be applied on existing collections of prior reviewed matters to improve overall ESI categorization criteria prior to the initial searches.
9.3. Effectiveness Benchmark for Searches
When comparing multiple search technologies, it is important to note that each search technology is likely to present different sets of results. Even within a single technology, multiple vendor offerings may produce different search results. Also, one has to evaluate the context of the search in the EDRM workflow to assess the effectiveness of searches.
- In the case of searching meta-data using fielded search for Custodian Identification etc., there is a very high threshold of accuracy and that it is critical to understand variations in field properties, naming conventions and syntax.
- In early case analysis, search needs to be fast, but results must be ordered by relevance so that search results can be evaluated quickly and iterated.
- In the case of large-scale culling, searches must divide the population such that large populations fall into the culled/non-responsive bucket.
- In the case of searching for potentially responsive documents, the iterative validation should eliminate false negatives and minimize false positives.
- In the case of potentially privileged documents, the searches should be targeted against the responsive population and have a low threshold for false negatives.
A significant consideration is whether a document is responsive and how that determination is made. It is quite possible that a document is considered responsive because of a subject-matter expert of an expert human reader familiar with the legal issues at hand has determined that the document is responsive. It is not necessary that a document that is responsive contain a specific set of search terms, so making the determination of responsiveness a subjective determination. Consequently, the notion of how effective a search has been in identifying responsive documents is itself subjective. Furthermore, it is often impossible to determine the human review based determination for the entire document collection, so a complete assessment of a search effort for every e-discovery undertaking is not feasible.
Similarly, effectiveness of a search methodology for privilege or confidentiality review is also difficult to measure. While these reviews typically involve fewer documents (i.e., only the responsive), the cost of a review escape is very high in that it may cause waiver of privilege. So, a search that has a few false negatives will likely result in inadvertent production.
In the absence of this, search effectiveness is often based on two methods.
- Determine for a specific known collection, what the effectiveness of a particular search algorithm/methodology is.
- Perform sampling on the collection and judge the effectiveness against the sample.
9.4. Search Accuracy: Precision and Recall
Search accuracy is often measured using information retrieval metrics Precision and Recall[1. Amit Singhal, John Choi, Donald Hindle, David Lewis, and Fernando Pereira, AT&T at TREC-7, Proceedings of the Seventh Text Retrieval Conference (TREC-7), pages 239-252. NIST Special Publication 500-242, July 1999.] [2. Performance Measures in Information Retrieval, Wikipedia article.]. While these measures are good at recording effectiveness of a particular search, a complete e-discovery production may involve a combined set of searches, and a combined score is more relevant for discussion.
Precision measures the number of truly responsive documents in the retrieved set of responsive documents. Recall measures the number of responsive documents retrieved compared to the total number of responsive documents in the corpus. These two ratios have been used extensively to characterize effectiveness of information retrieval systems.[3. Blair, D. C. Maron, M.E. (1985). An Evaluation of Retrieval Effectiveness for a Full-Text Document-Retrieval System, Communications of the ACM, 28, 289-299.]
Precision = {Number of Responsive Documents Retrieved} / {Total Number Retrieved}
Recall = {Number of Responsive Documents Retrieved} / {Total Number Responsive}
A higher precision number implies that a large percentage of responsive documents are retrieved and only a small number of non-responsive documents were categorized as responsive. A higher precision number is helpful in establishing an efficient second-pass human review. Also, if precision numbers are high, one can even contemplate eliminating a second pass review.
A higher recall number suggests that the information retrieval system was effective in retrieving higher percentage of responsive documents, and fewer of the responsive documents are left in the unretrieved collection. A very small recall suggests improving the automatic retrieval methods.
Another factor that must be considered is the total number of documents. It is often possible to achieve high Precision and high Recall rates in a small corpus, but as the corpus size increases, these rates drop considerably. In many automatic information retrieval scenarios, the Recall rates can be increased easily, but the Precision rates drop. The intersection of the Precision and Recall point is a critical optimization data point, and information retrieval systems attempt to move this data point closer to the top right corner as shown below.
Another measure that is useful is a single measure known as the F-Measure, which combines both these measures into a single value.
F = 1/(α/P + (1- α)/R)
In the above formula, P is the precision ratio, and R is the Recall ratio, and α is a weight for giving different levels of importance to Precision vs. Recall. In most cases, equal weighting is chosen, so the α value can be set to 1. This single measure is representative of search effectiveness.
9.4.1. Determining Precision
For any search methodology, determining Precision is an easier task, since both the number of truly responsive documents and the number of retrieved documents can be determined easily. This is the case when the search selection criteria are specific enough to narrow the results to a small number. In our examples above, the selection criteria was able to narrow the total retrieved count to about 1% of the total size of the corpus. This set is then subjected to a human review to determine if the document is truly responsive.
In situations where the selection criteria are not sufficient to reduce the retrieved set, sampling technology is used to evaluate the precision.
9.4.2. Determining Recall
Since Recall measures the ratio of responsive documents against the full corpus, the number of responsive documents in the corpus is difficult to determine. This is because in general, effective automated culling methodologies leave a larger percentage of documents as not responsive. To perform a human review of the non-responsive collection is often cost prohibitive, and would defeat the initial purpose of the automation. To overcome this problem, we use sampling methodology to determine the number of responsive documents in the unretrieved set of documents, and estimate the responsive documents that were not selected.
An alternative to sampling-based evaluation of Precision and Recall is to rank the documents by a scoring order and then selecting only a certain number of top-ranked documents. This is often referred as relevance cutoff-based evaluation. Using a smaller set allows one to derive precision and recall, and then the results are extrapolated to the entire collection. We believe that this extrapolation is not as reliable as the sampling-based determination.
9.4.3. Iterative Searching
Quite often, retrieval effectiveness of a search methodology may be low, and it may not be apparent that significant number of expected documents are not actually located. As an example, a landmark study by Blair and Moran[4. Blair, D. C. Maron, M.E. (1985). An Evaluation of Retrieval Effectiveness for a Full-Text Document-Retrieval System, Communications of the ACM, 28, 289-299.] determined that keyword-based searches had a recall of only 20%. Some of the ways to improve retrieval effectiveness are to iterate multiple times with newer search queries. These queries can be designed to improve both precision and recall.
Methods for improving recall are:
- Supply additional search terms.
- Introduce new search technologies, such as concept search.
- Increase keyword coverage using wildcards, fuzzy specification.
- Use the initial search as a “training step” and incorporate a learning mechanism to locate other like-documents.
Methods for improving precision are:
- Supply more specific queries, involving more complex combinations of Boolean Logic to improve Precision.
- Perform a pre-analysis of wildcards, misspellings etc., and eliminate unnecessary expansions of wildcards, thereby improving Precision.
9.5. Sampling and Quality Control Methodology for Searches
To evaluate effectiveness of a search strategy during early stages of case analysis is to perform sampling of the results and evaluate the sample. Proper selection of samples will likely yield quick, cost-effective evaluation which is critical when multiple evaluations are needed. Use of sampling is also helpful when performing final evaluation of searches as part of a quality control step. Once again, cost of evaluating large collection of ESI using a human review is an important consideration, and sampling reduces the number of documents that needs to be examined.
The basic application of sampling requires a random selection of items from a larger population and evaluation of the presence of an attribute (such as Responsiveness) in the sample, and then estimation of the characteristics of the population. In doing this, one accepts a certain Error of Estimation, and a Confidence Interval that the estimated measure is within that Error. As an example, sampling 1537 entries can provide an estimate of +/- 5% with a confidence interval of 95%. One of the results from sampling theory is that sampling a population for an attribute with a certain error and confidence interval does not depend on the size of the population.[5. How to determine sample size: Six Sigma Initiative.]
An important aspect is exactly how one would select the samples. In general, a sampling effort takes into consideration broad knowledge of the population, and devices an unbiased selection. In most cases, the party performing the sampling has some knowledge of the population and there is one party with that knowledge. In contrast, most litigations where there is an adversarial relationship between a Requesting Party and a Producing Party, and since only one party has access to the underlying population of documents, agreeing on a sampling strategy is hard. An effective methodology is one that would require no knowledge of the data, but is still able to apply random selection process central to the effectiveness of sampling.
Additional details on various aspects of reliable sampling are described in Appendix 2.