Section 3 - Guidelines and Considerations

April 27, 2012

While statistical sampling can be very powerful, it is also important that it not be used incorrectly. Before going further into technical matters, this section is intended to provide guidance addressed at preventing problems.

3.1. Cull Prior to Sampling

Prior to evaluation and sampling, a population should be culled to remove files/documents that can safely be determined to not possibly be of interest. Consider, for example, an initial population of 1,000,000. Based on non-statistical techniques (such as interviews, dates, file type analysis, etc.), it is determined that 900,000 cannot possibly be of interest. The evaluation should be based on the remaining 100,000. The remaining 100,000 are the new universe for purposes of evaluation and sampling.

Aside from reflecting general common sense, this makes sense in terms of standard use of statistics. Based on the foregoing example, assume the inquiry is “responsiveness” and that it is believed a priori that none of the 900,000 are responsive. Using the basic normal approximation, it would be possible to sample 400 from the 1,000,000 population, knowing that the only possible responsive documents are from the 100,000 subset, and thus knowing that the sample proportion should be less than 10%.

Assume further that the observed proportion is 6%. Since the sample is from 1,000,000 documents, the 95% confidence level at a 5% margin of error, would imply a confidence interval from 10,000 = (.06 – .05) * 1,000,000 to 110,000 = (.06 + .05) * 1,000,000. This is obviously a very wide and basically useless confidence interval. (Techniques that are more precise than the “classical” techniques would likely result in a similar conclusion.)

Therefore, from a statistics perspective, it makes more sense to run the sample of 400 on only the 100,000. Consistent with the foregoing, let’s say the sample proportion of this reduced population is 60%. Since the sample is from 100,000 documents, the 95% confidence level at a 5% margin of error, would imply a confidence interval from 55,000 = (.60 – .05) * 100,000 to 65,000 = (.60 + .05) * 100,000. This is obviously a much tighter and more useful confidence interval.

Finally, there may be contexts in which it is felt that the entire initial 1,000,000 population – or the excluded 900,000 — should be subjected to some sort of sampling. A good example would be as a check or review of the non-statistically based decision to exclude. This is really a different inquiry than the responsiveness estimate for the 100,000, and might be accomplished more efficiently using an acceptance testing approach, limited to the 900,000. The general point is that this analysis of the 900,000 can still be handled separately, and that it is not a good use of resources to combine this group with the 100,000 that are known to be of interest.

3.2. Be Sure that Family Relationships are Known Prior to Sampling

An area of general importance in e-discovery is the handling of families. Briefly, a family (of documents) is a collection of documents that are connected to each other. A common example is an email plus its attachments. A chain of emails would also be a family.

Aside from any considerations of statistical sampling, there are certain practices when handling families in the context of e-discovery.

The evaluation of a document that is part of a family (for example, as “responsive”) should not been done in isolation. Instead, the reviewer should examine the entire family. It is reasonably possible that a document that is an attachment to an email would not be responsive on its face, but would be responsive based on how it is referred to in the main body of the email.
Many feel all the documents in family should be coded the same way. For example, if any part of the family is responsive, the entire family is responsive and all component documents are responsive. This is consistent with the view that the entire family constitutes a single communication. Others are less insistent on this, and there may be circumstances where a different approach is warranted. (For example, the only existing version of a non-privileged document is as an attachment to a privileged email.) The handling could also depend on the particular agreements and expectations of the producing parties.
Here is a not-too-recent posting that discusses the handling of families. (webpage no longer available).

While the management of families, itself, is not an issue of statistics, it has implications when there is a desire to apply sampling techniques.

If it is an accepted general protocol that a document should be evaluated as part of a family, this protocol should also be followed when sampling documents to estimate document responsiveness. More specifically, if a document has been selected as a sample observation, and that that document is part of a multi-document family, the reviewer should also examine non-sampled family members in order to reach a decision on the sampled document.
Thus, it is good practice to be sure that family relationships are known before commencing the sampling.
Note – While the non-sampled family members do not become part of the sample, they must be considered when evaluating the sampled family member because they can provide additional context.

If the sampling looks only at documents without taking into account family relationships, it will not conform to the basic requirement that each sample observation should be evaluated in the same way that it would have been if the entire population were being observed. This could significantly distort the results.

Finally, in the area of documents vs. families, it is worth pointing out that it is possible to analyze an attribute at both the family level and the document level. The question of, “What percentage of documents is responsive?” is distinct from the question of, “What percentage of families is responsive?” If answering the first question, you would randomly sample documents. If answering the second question, you would randomly sample families. Depending on the need and circumstances, either question might make sense, but be aware of the need to handle them differently.

3.3. Recognize When Your Standards Change

Practitioners understand that there is not always perfect certainty for attribute such as document responsiveness. Some calls are close calls. This does not, in itself, undermine the validity of statistical sampling, as long as the calls are being made under a consistent standard.

It is possible, however, that the actual standards for responsiveness can change in the course of a review. This change in standards might be based on information and observations garnered in the early stages of the review. If this is case, then of course it would not be sound to use a sample based on one set of standards to estimate proportions under different standards.

3.4. Simple Random Sampling vs. Stratified Sampling: Consider Sub-Populations

Continuing with the foregoing example, after culling from 1,000,000 to 100,000, and then making sure that family relationships are known in that 100,000, statistical sampling can then be validly applied to the new population universe of 100,000 documents/files.

“Simple random sampling” is sampling in which each member of the population has an equal chance of being observed. Simple random sampling can be used to estimate proportions of the population (e.g., proportion responsive) in this case, using the techniques set forth in Sections 2 and 4.

It is possible that there are readily identifiable sub-populations within the population. Judgmental sampling can be used to determine sub-populations of interest. A common example in e-discovery is sub-populations based on custodian. Another example might be sub-populations based on file type.

If this is the case, it may more make sense in terms of cost and in terms of matters of interest, to sample separately from within the sub-populations. This is known as “stratified sampling”.
Stratified sampling is not necessary if the only goal of the sampling is to estimate some attribute of the full population, but it would be necessary if the goal is to sample that attribute for each sub-group.
Depending on the facts of each case, it may be concluded that stratified sampling is more cost efficient than simple random sampling. But it might also be concluded that simple random sampling is more cost efficient.
(Further discussion of stratified sampling is outside the scope of this document.)

3.5. Quality Control

Quality control (“QC”) review is an area where it is very appropriate to use statistical sampling. Whether the initial coding of a set of documents was done by humans or by machines, it is generally agreed that the QC review process is making a simple, binary evaluation of whether the initial coding is correct or incorrect. Another value of statistical sampling in this case is that QC review is inherently not intended to re-examine every document – the process necessarily involves sampling of some sort, and statistical techniques can provide the quantitative guidance.

QC review has not always been considered a standard procedure, especially when the initial coding was performed by humans. The advent of machine coding has increased the recognition that QC is a vital part of the e-discovery process.

The current sense is that, when statistical sampling is actually employed in e-discovery review contexts, it tends to use the techniques for estimation of population proportion, as described in Sections 2 and 4. However, many industrial and commercial QC applications are understood to use a somewhat different technique known as acceptance testing. We anticipate future EDRM research in this area.

3.6. Use of Sampling in Machine Learning

A full discussion of machine learning is outside the scope of this document. Certainly, the mathematical/statistical algorithms that are coded within the software are outside the scope of this document. However, there are some observations that may be useful in the machine learning context.

The general technique for machine learning is that a sample of documents is coded by experienced people/experts. These documents then become the “seed” set that the machine uses for learning.
It is recognized that it is important to the process that this sample be unbiased and random.
Some of the issues previously referenced can emerge in this context. For example, the considerations regarding initial culling are applicable in developing the population for machine review.