Section 4 – Additional Guidance on Statistical Theory

April 27, 2012

It is not the intent of this document to all the requisite statistical theory at the level of the underlying formulas.

The amount of explanation that would be necessary to provide a “non-math” audience with a correct understanding is extensive, and would not necessarily be of interest to most members of that audience.

The target audience for this section is mainly those who are working in e-discovery, and who already have some interest and experience with math at the college level. These could be people in any number of e-discovery roles, who have decided, or who have been called upon, to refresh and enhance their skills in this area.

Thus, this section is written from the perspective of guiding this target audience.

4.1. Foundational Math Concepts

The three main probability distributions that should be understood are the binomial distribution, the hypergeometric distribution, and the normal distribution. These will also be covered in standard college textbooks on probabilities and statistics. Wikipedia has articles on all of these, although, of course, Wikipedia must be used with caution.

4.2. Binomial Distribution

The binomial distribution models what can happen if there are n trials of a process, each trial can only have two outcomes, and the probability of success for each trial is the same.

  • Mathematicians refer generally to the outcomes as “success” or “failure”. Depending on the context, the two possible outcomes might specifically be “yes” or “no”; or 1 or 0; or “heads” or “tails”. In the e-discovery context, a document can be “responsive” or “not responsive”.
  • The probability of success for each trial is p, which must be between 0 and 1. The corresponding probability of failure is (1-p).
  • This is the conceptually easiest model.

4.3. Hypergeometric Distribution

The hypergeometric distribution models what can happen if there are n trials of a process, and each trial can only have two outcomes, but the trials are drawn from a finite population. They are drawn from this population “without replacement”, meaning that they are not returned to the population and thus cannot be selected again.

  • Even if the population initially contains a certain proportion, p, of successes, the selection of the first trial alters that proportion in the remaining population. (If the first trial is a failure, the proportion of successes in the remaining population becomes slightly higher than p, and vice versa.)
  • The main reason for mentioning the hypergeometric is that this is actually the model that most accurately describes the typical e-discovery example. I.e., the population is finite, and sampling will be done without replacement.
  • The math for the hypergeometric is more complicated than for the binomial, and it is not supported as well in tools such as Excel.
  • The larger the population, the closer the results will get to the binomial (which can be thought of as the extreme case of an infinite population).
  • For most practical purposes, when attempting to estimate population proportions, the binomial is acceptable as an approximation to the hypergeometric. All other things being equal, the binomial results will show slightly bigger confidence intervals or slightly lower confidence levels. Thus, it is conservative to use the binomial math where the actual process is hypergeometric.

4.4. Normal Distribution

The normal distribution is the familiar “bell curve”. It is more abstract than the binary binomial and hypergeometric distributions. It depends on calculus and e and π. However, it has some very useful characteristics.

  • Mathematicians have proven that if you were to take samples of a given size from any population, the sample averages would be distributed approximately according to the normal distribution. (If you want to know more about this, research the “central limit theorem”.)
  • Especially in the days before computers, the normal distribution was much easier to work with than the binomial. The routine process would be to use the normal approximations.
  • Even today, when attempting to calculate margin of error or sample size, if your tool is Excel, it is easier to use the normal approximation than the binomial or the hypergeometric.
  • This is the source of the familiar rules, such as the rule that you can get a 95% confidence level and a 5% margin of error with a sample of size 400.
  • This approximation improves as the sample size increases, and works better when the proportion is towards the middle (.50) and not the extremes (0 or 1). If working on a problem where the probabilities are at the extremes, it is better to use the binomial or the hypergeometric.[1. The literature contains various guidelines for when the normal approximation is satisfactory. As examples,
    • np(1-p) > 9. Brownlee, K. A. Statistical Theory and Methodology in Science and Engineering. John Wiley & Sons, Inc. New York, N.Y. (1965), p. 140.
    • np(1-p) ≥ 10. Ross, Sheldon M. Introduction to Probability Models. Academic Press. San Diego (1997), p. 75
    • np ≥ 5 AND n(1-p) ≥ 5. Finkelstein, Michael O. and Levin, Bruce. Statistics for Lawyers. Springer-Verlag. New York (1990), p. 121.

    Of course, multiple guidelines also appear in Wikipedia.]

4.5. The More Information You Have, the More Precise You Can Be

The most basic approach is to solve for confidence level, margin of error or sample size in term of the other two. When this is done, the math makes the conservative assumptions that (1) the proportion of successes is 0.50, and (2) the underlying population size is infinite.

Greater precision is possible if the actual proportion of successes and/or the size of the finite population are known. The Excel examples help to demonstrate this. The tradeoff is that this requires more intricate math. Thus, over the course of a project, one can start with conservative standard guidelines and evolve toward a more precise picture as more is known.

4.6. Non-Binary Cases

This document has repeatedly stressed that the calculations assume only two possible outcomes. This does not mean that statistical sampling is useless where there are more than two possible outcomes. It only means that the math becomes more complicated, and the required sample sizes may increase (for the same confidence levels and margins or error.)

Therefore, it is not the intent of this document to discourage approaches that involve a sample set of more than two possible outcomes, but only to advise that the results here cannot be applied in those situations.

Further research in this area is anticipated.