Predictive Ranking: Technology Assisted Review Designed for the Real World

An EDRM White Paper

Predictive Ranking White Paper (2458 downloads )

By Jeremy Pickens, Senior Applied Research Scientist, Catalyst Repository Systems | February 1, 2013

Why Predictive Ranking?

Most articles about technology assisted review (TAR) start with dire warnings about the explosion in electronic data. In most legal matters, however, the reality is that the quantity of data is big, but it is no explosion. The fact of the matter is that even a half million documents—a relatively small number in comparison to the “big data” of the web—pose a significant and serious challenge to a review team. That is a lot of documents and can cost a lot of money to review, especially if you have to go through them in a manual, linear fashion. Catalyst’s Predictive Ranking bypasses that linearity, helping you zero-in on the documents that matter most. But that is only part of what it does.

In the real world of e-discovery search and review, the challenges lawyers face come not merely from the explosion of data, but also from the constraints imposed by rolling collection, immediate deadlines, and non-standardized (and at times confusing) validation procedures. Overcoming these challenges is as much about process and workflow as it is about the technology that can be specifically crafted to enable that workflow. For these real-world challenges, Catalyst’s Predictive Ranking provides solutions that no other TAR process can offer.

In this article, we will give an overview of Catalyst’s Predictive Ranking and discuss how it differs from other TAR systems in its ability to respond to the dynamics of real-world litigation. But first, we will start with an overview of the TAR process and discuss some concepts that are key to understanding how it works.

What is Predictive Ranking?

Predictive Ranking is Catalyst’s proprietary TAR process. We developed it more than four years ago and have continued to refine and improve it ever since. It is the process used in our newly released product, Insight Predict.

In general, all the various forms of TAR share common denominators: machine learning, sampling, subjective coding of documents, and refinement. But at the end of the day, the basic concept of TAR is simple, in that it must accomplish only two essential tasks:

  1. Finding all (or “proportionally all”) responsive documents.
  2. Verifying that all (or “proportionally all”) responsive documents have been found.

That is it. For short, let us call these two goals “finding” and “validating.”

Finding Responsive Documents

Finding consists of two parts:

  1. Locating and selecting documents to label. By label, we mean manually mark them as responsive or nonresponsive.
  2. Propagating (via an algorithmic inference engine) these labels onto unseen documents.

This process of finding or searching for responsive documents is typically evaluated using two qualitative measures: precision and recall. Precision is a measure of the number of true hits (actually responsive documents) in the search compared against the total number of hits returned. Recall is a measure of the total true hits returned from the search against the actual number of true hits in the population.

One area of contention and disagreement among vendors is step 1, the sampling procedures used to train the algorithm in step 2. Vendors’ philosophies general fall into one of two camps, which loosely can be described as judgmentalists and randomists.

The judgmentalist approach assumes that litigation counsel (or the review manager) has the most insightful knowledge about the domain and matter and is therefore going to be the most effective at choosing training documents. The randomist approach, on the other hand, is concerned about bias. Expertise can help the system quickly find certain pockets of responsive information, the randomists concede, but the problem they see is that even experts do not know what they do not know. By focusing the attention of the system on some documents and not others, the judgmental approach potentially ignores large swaths of responsive information even while it does exceptionally well at finding others.

Therefore, the random approach samples every document in the collection with equal probability. This even-handed approach mitigates the problem of human bias and ensures that a wide set of starting points are selected. However, there is still no guarantee that a simple random sample will find those known pockets of responsive information about which the human assessor has more intimate knowledge.

At Catalyst, we recognize merits in both approaches. An ideal process would be one that combines the strengths of each to overcome the weakness of the other. One straightforward solution is to take the more is more approach and do both judgmental and random sampling. A combined sample not only has the advantage of human expertise, but also avoids some of the issues of bias.

However, while it is important to avoid bias, simple random sampling misses the point. Random sampling is good for estimating counts; it does not do as well at guaranteeing topical coverage (suspecting all pockets). The best way to avoid bias is not to pick random documents, but to select documents about which you know that you know very little. Let’s call it “diverse topical coverage.”

Remember the difference between the two goals: finding vs. validating. For validation, a statistically valid random sample is required. But for finding, we can be more intelligent than that. We can use intelligent algorithms to explicitly detect which documents we know the least about, no matter which other documents we already know something about. This is more than just simple random sampling, which has no guarantee to topically cover a collection. This is using algorithms to explicitly seek out those documents about which we know nothing or next to nothing. The Catalyst approach is therefore to not stand in the way of our clients by shoehorning them into a single sampling regimen for the purpose of finding. Rather, our clients may pick whatever documents that they want to judge, for whatever reason and “contextual diversity sampling” will detect any imbalances and help select the rest.

Examples of Finding

The following examples illustrate the performance of Catalyst’s intelligent algorithms with respect to the various points that were made in the previous section about random, judgmental, and contextual diversity sampling. In each of these examples, the horizontal x-axis represents the percentage of the collection that must be reviewed in order to find (on the y-axis) the given recall level using Catalyst’s Predictive Ranking algorithms.

For example, in this first graph we have a Predictive Ranking task with a significant number of responsive documents, a high richness. There are two lines, each representing a different initial seed condition: random versus judgmental. The first thing to note is that judgmental sampling starts slightly ahead of random sampling. The difference is not huge; the judgmental approach finds perhaps 2-3% more documents initially. That is to be expected, because the whole point of judgmental sampling is that the human can use his or her intelligence and insight into the case or domain to find documents that the computer is not capable of finding by strictly random sampling.

That brings us to the concern that judgmental sampling is biased and will not allow TAR algorithms to find all the documents. However, this chart shows that by using Catalyst’s intelligent iterative Predictive Ranking algorithms, both the judgmental and random initial sampling get to the same place. They both get about 80% of the available responsive documents after reviewing only 6% of the collection, 90% after reviewing about 12% of the collection, and so forth. Initial differences and biases are swallowed up by Catalyst’s intelligent Predictive Ranking algorithms.

In the second graph, we have a different matter in which the number of available responsive documents is over an order of magnitude less than in the previous example; the collection is very sparse. In this case, random sampling is not enough. A random sample does not find any responsive documents, so nothing can be learned by any algorithm. However, the judgmental sample does find a number of responsive documents, and even with this sparse matter, 85% of the available responsive documents may be found by only examining a little more than 6% of the collection.

However, a different story emerges when the user chooses to switch on contextual diversity sampling as part of the algorithmic learning process. In the previous example, contextual diversity was not needed. In this case, especially with the failure of the random sampling approach, it is. The following graph shows the results of both random sampling and judgmental sampling with contextual diversity activated, alongside the original results with no contextual diversity:

Adding contextual diversity to the judgmental seed has the effect of slowing learning in the initial phases. However, after only about 3.5% of the way through the collection, it catches up to the judgmental-only approach and even surpasses it. A 95% recall may be achieved a little less than 8% of the way through the collection. The results for adding contextual diversity to the random sampling are even more striking. It also catches up to judgmental sampling about 4% of the way through the collection and also surpasses it by the end, ending up at just over 90% recall a little less than 8% of the way through the collection.

These examples serve two primary purposes. First, they demonstrate that Catalyst’s iterative Predictive Ranking algorithms work, and work well. The vast majority of a collection does not need to be reviewed, because the Predictive Ranking algorithm finds 85%, 90%, 95% of all available responsive documents within only a few percent of the entire collection.

Second, these examples demonstrate that, no matter how you start, you will attain that good result. It is this second point that bears repeating and further consideration. Real-world e-discovery is messy. Collection is rolling. Deadlines are imminent. Experts are not always available when you need them to be available. It is not always feasible to start a TAR project in the clean, perfect, step-by-step manner that a vendor might require. Knowing that one can instead start either with judgmental samples or with random samples, and that the ability to add a contextual diversity option ensures that early shortcomings are not only mitigated but exceeded, is of critical importance to a TAR project.

Validating What You Have Found

Validating is an essential step in ensuring legal defensibility. There are multiple ways of doing it. Yes, there needs to be random sampling. Yes, it needs to be statistically significant. But there are different ways of structuring the random samples. The most common method is to do a simple random sample of the collection as a whole, and then another simple random sample of the documents that the machine has labeled as nonresponsive. If the richness of responsive documents in the latter sample has significantly decreased from the responsive-document richness in the initial whole population, then the process is considered to be valid.

However, at Catalyst we use a different procedure, one that we think is better at validating results. Like other methods, it also relies on random sampling. However, instead of doing a simple random sample of a set of documents, we use a systematic random sample of a ranking of documents. Instead of labeling documents first and sampling for richness second, the Catalyst procedure ranks all documents by their likelihood of being responsive. Only then is a random sample—a systematic random sample—taken.

At equal intervals across the entire list, samples are drawn. This gives Catalyst the ability to better estimate the concentration of responsive documents at every point in the list than an approach based on unordered simple random sampling. With this better estimate, a smarter decision boundary can be drawn between the responsive and nonresponsive documents. In addition, because the documents on either side of that boundary have already been systematically sampled, there is no need for a two-stage sampling procedure.

Workflow: Putting Finding and Validating Together

In the previous section, we introduced the two primary tasks involved in TAR: finding and validation. If machines (and humans, for that matter) were perfect, there would be no need for these two stages. There would only be a need for a single stage. For example, if a machine algorithm were known to perfectly find every responsive document in the collection, there would be no need to validate the algorithm’s output. And if a validation process could perfectly detect when all documents are correctly labeled, there would be no need to use an algorithm to find all the responsive ones; all possible configurations (combinatorial issues aside) could be tested until the correct one is found.

But no perfect solutions exist for either task, nor will they in the future. Thus, the reason for having a two-stage TAR process is so that each stage can provide checks and balances to the other. Validation ensures that finding is working, and finding ensures that validation will succeed.

Therefore, TAR requires some combination of both tasks. The manner in which both finding and validation are symbiotically combined is known as the e-discovery “workflow.” Workflow is a non-standard process that varies from vendor to vendor. For the most part, every vendor’s technology combines these tasks in a way that, ultimately, is defensible. However, defensibility is the minimum bar that must be cleared.

Some combinations might work more efficiently than others. Some combinations might work more effectively than others. And some workflows allow for more flexibility to meet the challenges of real world e-discovery, such as rolling collection.

We’ll discuss a standard model, typical of the industry, then review Catalyst’s approach, and finally conclude with the reason Catalyst’s approach is better. Hint: It’s not (only) about effectiveness, although we will show that it is that. Rather, it is about flexibility, which is crucial in the work environments in which lawyers and review teams use this technology.

Standard TAR Workflow

Most TAR technologies follow the same essential workflow. As we will explain, this standard workflow suffers from two weaknesses when applied in the context of real-world litigation. Here are the steps it entails:

  1. Estimate via simple random sampling how many responsive and nonresponsive docs there are in the collection (aka estimate whole population richness).
  2. Sample (and manually, subjectively code) documents.
  3. Feed those documents to a predictive coding engine to label the remainder of the collection.
  4. If manual intervention is needed to assist in the labeling (for example via threshold or rank-cutoff setting), do so at this point.
  5. Estimate via sample random sampling how many responsive documents there are in the set of documents that have been labeled in steps 3 and 4 as nonresponsive.
  6. Compare the estimate in step 5 with the estimate in step 1. If there has been a significant decrease in responsive richness, then the process as a whole is valid.

TAR as a whole relies on these six steps working as a harmonious process. However, each step is not done for the same reason. Steps 2-4 are for the purpose of finding and labeling. Steps 1, 5, and 6 are for the purpose of validation.

The first potential weakness in this standard workflow stems from the fact that the validation step is split into two parts, one at the very beginning and one at the very end. It is the relative comparison between the beginning and the end that gives this simple random-sampling-based workflow its validity. However, that also means that in order to establish validity, no new documents may arrive at any point after the workflow as started. Collection must be finished.

In real-world settings, collection is rarely complete at the outset. If new documents arrive after the whole-population richness estimate (step 1) is already done, then that estimate will no longer be statistically valid. And if that initial estimate is no longer valid, then the final estimates (step 5), which compare themselves to that initial estimate, will also not be valid. Thus, the process falls apart.

The second potential weakness in the standard workflow is that the manual intervention for threshold setting (step 4) occurs before the second (and final) random sampling (step 5). This is crucial to the manner in which the standard workflow operates. In order to compare before and after richness estimates (step 1 vs. step 5), concrete decisions will have had to be made about labels and decision boundaries. But in real-world settings, it may be premature to make concrete decisions at this point in the overall review.

How Catalyst’s Workflow Differs

In order to circumvent these weaknesses and match our process more closely to real-world litigation, Catalyst’s Predictive Ranking uses a proprietary, four-step workflow:

  1. Sample (and manually, subjectively code) documents.
  2. Feed those documents to our Predictive Ranking engine to rank the remainder of the collection.
  3. Estimate via a systematic random sample the relative concentration of responsive documents throughout the ranking created in step 2.
  4. Based on the concentration estimate from step 3, select a threshold or rank-cutoff setting which gives the desired recall and/or precision.

Once again, as with the standard predictive coding workflow, our Predictive Ranking as a whole relies on these four steps working as a harmonious process. However, each step is not done for the same reason. Steps 1 and 2 are for the purpose of finding and labeling. Steps 3 and 4 are for the purpose of validation.

Two important points should be noted about Catalyst’s workflow. The first is that the validation step is not split into two parts. Validation only happens at the very end of the entire workflow. If more documents arrive while documents are being found and labeled during steps 1 and 2 (i.e. if collection is rolling), the addition of new documents does not interfere with anything critical to the validation of the process. (Additional documents might make finding more difficult; finding is a separate issue from validating, one which Catalyst’s contextual diversity sampling algorithms are designed to address.)

The fact that validation in our workflow is not hampered by collections that are fluid and dynamic is significant. In real-world e-discovery situations, rolling collection is the norm. Our ability to handle this fluidity natively—by which we mean central to the way the workflow normally works, rather than as a tacked-on exception—is highly valuable to lawyers and review teams.

The second important point to note about Catalyst’s workflow is that the manual intervention for threshold setting (step 4) happens after the systematic random sample. At first it may seem counterintuitive as to why this is defensible, because choices about the labeling of documents are happening after a random sample has been taken. But the purpose of the systematic random sample is to estimate concentrations in a statistically valid manner. Since the concentration estimates themselves are valid, decisions made based on those concentrations are also valid.

Consequences and Benefits of the Catalyst Workflow

We already touched on two key ways in which the Catalyst Predictive Ranking workflow is unique from the industry standard workflow. It is important to understand what our workflow allows us—and you—to do:

  1. Get good results. Catalyst Predictive Ranking consistently demonstrates high scores for both precision and recall.
  2. Add more training samples, of any kind, at any time. That allows the flexibility of having judgmental samples without bias.
  3. Add more documents, of any kind, at any time. You don’t have to wait 158 days until all documents are collected. And you don’t have to repeat step 1 of the standard workflow when those additional documents arrive.
  4. Go through multiple stages of culling and filtering without hampering validation. In the standard workflow, that would destroy your baseline. This is not a concern with the Catalyst approach, which saves the validation to the very end, via the systematic sample.

Catalyst has more than four years of experience using Predictive Ranking techniques to target review and reduce document populations. Our algorithms are highly refined and highly effective. Even more important, however, is that our Predictive Ranking workflow has what other vendors’ workflows do not—the flexibility to accommodate real-world e-discovery. Out there in the trenches of litigation, e-discovery is a dynamic process. Whereas other vendors’ TAR workflows require a static collection, ours flows with the dynamics of your case.

About Jeremy Pickens

Jeremy Pickens, Ph.D., is one of the world’s leading search scientists and a pioneer in the field of collaborative exploratory search, a form of search in which a group of people who share a common information need actively collaborate to achieve it. Dr. Pickens has six patents pending in the field of search and information retrieval, including two for collaborative exploratory search systems.

At Catalyst, Dr. Pickens researches and develops methods of using collaborative search to achieve more intelligent and precise results in e-discovery search and review. He also studies other ways to enhance search and review within the Catalyst system.

Dr. Pickens earned his master’s and doctoral degrees at the University of Massachusetts, Amherst, Center for Intelligent Information Retrieval. He conducted his post-doctoral work at King’s College, London, on a joint grant with Goldsmiths University of London. As part of the OMRAS project (Online Music Recognition and Searching), he helped organize the first Music Information Retrieval (ISMIR) conference in Plymouth, Mass. Before joining Catalyst, Dr. Pickens spent five years as a research scientist at FX Palo Alto Lab, where his major research themes included video search and collaborative exploratory search.

Dr. Pickens is co-author of the forthcoming book, A Taxonomy of Collaborative Information Seeking, to be published by Morgan & Claypool Publishers. He was an editor of the spring 2010 special issue on collaborative information seeking of the journal Information Processing and Management. He is a frequent author and speaker on the topic.

Disclaimer

Unless otherwise noted, all opinions expressed in the EDRM White Paper Series materials are those of the authors, of course, and not of EDRM, EDRM participants, the author’s employers, or anyone else.