Finding the Just Right Approach: Goldilocks and Technology Assisted Review in eDiscovery

ComplexDiscovery TAR blog: 3 bears of Goldilocks story — Image: ComplexDiscovery

[Editor’s Note: This analysis was originally published by Rob Robinson and ComplexDiscovery on February 16, 2023, and EDRM is grateful for his permission to republish.]

The story of Goldilocks and the Three Bears provides a helpful analogy for understanding the different models of Technology-Assisted Review (TAR) and the importance of selecting the approach that is “just right” for eDiscovery objectives, timelines, and resources. Like Goldilocks, eDiscovery professionals must carefully consider the most appropriate approach for their situation.

For instance, TAR 1.0, including Simple Active Learning and Simple Passive Learning, can be compared to Goldilocks’ selection of chairs. In TAR 1.0, a control set and seed set of electronic data are established and reviewed until a sufficient number of relevant documents are identified. Just like Goldilocks, who had to choose the “just right” chair for her, eDiscovery professionals need to determine which TAR 1.0 model best suits their case.

TAR 2.0, or Continuous Active Learning® (CAL®), can be compared to Goldilocks’ selection of the bed. Just as Goldilocks had to try out different beds to find the one that was “just right,” TAR 2.0 involves establishing a seed set of relevant documents and repeatedly refining the machine learning algorithm to suggest the most likely responsive documents. This iterative process allows eDiscovery professionals to identify relevant information efficiently and accurately.

TAR 3.0, or Cluster-Centric CAL®, can be compared to Goldilocks’ selection of the bowl of porridge. Just like Goldilocks had to try different bowls of porridge to find the one that was “just right” for her, eDiscovery professionals using TAR 3.0 must carefully select the relevant clusters of electronic data to ensure that the machine learning algorithm is appropriately applied to the most relevant information. This approach also eliminates the use of control sets.

Finally, TAR 4.0, or Hybrid Multimodal IST Predictive Coding, can be compared to Goldilocks’ overall experience in the Three Bears’ house. The process begins with defining the scope of discovery, relevance, and related review procedures through ESI communications, conducting Multimodal Early Case Assessment (ECA), and taking a random sample to determine the prevalence of relevant information. This comprehensive approach involves selecting the appropriate chair, bed, and bowl of porridge, just like Goldilocks had to do to ensure a positive experience in the Three Bears’ house.

Like Goldilocks, eDiscovery professionals must carefully weigh their objectives, timelines, and resources when selecting the appropriate model of TAR. By choosing the most suitable approach, they can efficiently and accurately manage large volumes of electronic data in the electronic discovery process while maintaining strong cybersecurity and information governance practices.

Post Script: It’s worth noting that in the story of Goldilocks and the Three Bears, the “just right” option was the one that worked best for Goldilocks’ specific preferences and needs. Similarly, while TAR 4.0, developed and championed by industry expert, practitioner, and author Ralph Losey, may be the most comprehensive and suitable approach for complex eDiscovery cases, it’s essential to remember that every approach selection experience will be unique to the specific cybersecurity, information governance, and legal discovery professionals involved. Also, Losey readily concedes that his approach requires a high skill and experience level that not every practitioner or legal search expert has the time to attain. The right approach to TAR will always depend on the particular objectives, timelines, and resources of the case at hand. So, it’s essential to thoroughly evaluate the available options and select the approach that best fits the needs of the individual case.

Background Information*

General TAR Protocols (1,2,3,4,5,6,7)

Additionally, these technologies are generally employed as part of a TAR protocol which determines how the technologies are used. Examples of TAR protocols include:

Listed in Alphabetical Order

Continuous Active Learning® (CAL®): In CAL®, the TAR method developed, used, and advocated by Maura R. Grossman and Gordon V. Cormack, after the initial training set, the learner repeatedly selects the next-most-likely-to-be-relevant documents (that have not yet been considered) for review, coding, and training, and continues to do so until it can no longer find any more relevant documents. There is generally no second review because, by the time the learner stops learning, all documents deemed relevant by the learner have already been identified and manually reviewed.
Hybrid Multimodal Method: An approach developed by the e-Discovery Team (Ralph Losey) that includes all types of search methods, with primary reliance placed on predictive coding and the use of high-ranked documents for continuous active training.
Scalable Continuous Active Learning (S-CAL): The essential difference between S-CAL and CAL® is that for S-CAL, only a finite sample of documents from each successive batch is selected for labeling, and the process continues until the collection—or a large random sample of the collection—is exhausted. Together, the finite samples form a stratified sample of the document population, from which a statistical estimate of ρ may be derived.
Simple Active Learning (SAL): In SAL methods, after the initial training set, the learner selects the documents to be reviewed and coded by the teacher, and used as training examples, and continues to select examples until it is sufficiently trained. Typically, the documents the learner chooses are those about which the learner is least certain, and therefore from which it will learn the most. Once sufficiently trained, the learner is then used to label every document in the collection. As with SPL, the documents labeled as relevant are generally re-reviewed manually.
Simple Passive Learning (SPL): In simple passive learning (“SPL”) methods, the teacher (i.e., human operator) selects the documents to be used as training examples; the learner is trained using these examples, and once sufficiently trained, is used to label every document in the collection as relevant or non-relevant. Generally, the documents labeled as relevant by the learner are re-reviewed manually. This manual review represents a small fraction of the collection, and hence a small fraction of the time and cost of an exhaustive manual review.

TAR Workflows (1,8)

TAR workflows represent the practical application of predictive coding technologies and protocols to define approaches to completing predictive coding tasks. Four examples of TAR workflows include:

TAR 1.0 involves a training phase followed by a review phase with a control set being used to determine the optimal point when you should switch from training to review. The system no longer learns once the training phase is completed. The control set is a random set of documents that have been reviewed and marked as relevant or non-relevant. The control set documents are not used to train the system. They are used to assess the system’s predictions so training can be terminated when the benefits of additional training no longer outweigh the cost of additional training. Training can be with randomly selected documents, known as Simple Passive Learning (SPL), or it can involve documents chosen by the system to optimize learning efficiency, known as Simple Active Learning (SAL).
TAR 2.0 uses an approach called Continuous Active Learning® (CAL®), meaning that there is no separation between training and review–the system continues to learn throughout. While many approaches may be used to select documents for review, a significant component of CAL® is many iterations of predicting which documents are most likely to be relevant, reviewing them, and updating the predictions. Unlike TAR 1.0, TAR 2.0 tends to be very efficient even when prevalence is low. Since there is no separation between training and review, TAR 2.0 does not require a control set. Generating a control set can involve reviewing a large (especially when prevalence is low) number of non-relevant documents, so avoiding control sets is desirable.
TAR 3.0 requires a high-quality conceptual clustering algorithm that forms narrowly focused clusters of fixed size in concept space. It applies the TAR 2.0 methodology to just the cluster centers, which ensures that a diverse set of potentially relevant documents are reviewed. Once no more relevant cluster centers can be found, the reviewed cluster centers are used as training documents to make predictions for the full document population. There is no need for a control set–the system is well-trained when no additional relevant cluster centers can be found. Analysis of the cluster centers that were reviewed provides an estimate of the prevalence and the number of non-relevant documents that would be produced if documents were produced based purely on the predictions without human review. The user can decide to produce documents (not identified as potentially privileged) without review, similar to SAL from TAR 1.0 (but without a control set), or he/she can decide to review documents that have too much risk of being non-relevant (which can be used as additional training for the system, i.e., CAL®). The key point is that the user has the info he/she needs to make a decision about how to proceed after completing review of the cluster centers that are likely to be relevant, and nothing done before that point becomes invalidated by the decision (compare to starting with TAR 1.0, reviewing a control set, finding that the predictions aren’t good enough to produce documents without review, and then switching to TAR 2.0, which renders the control set virtually useless).
TAR 4.0 An approach developed by the e-Discovery Team (Ralph Losey) that includes all types of search methods, with primary reliance placed on predictive coding and the use of high-ranked documents for continuous active training. The approach includes some variations in the ideal workflow, and refinements on the continuous active training to facilitate double-loop feedback. This is called Intelligently Space Training (IST), instead of CAL®. It is an essential part of the Hybrid Multimodal IST method.

TAR Uses (9)

TAR technologies, protocols, and workflows can be used effectively to help eDiscovery professionals accomplish many data discovery and legal discovery tasks. Nine commonly considered examples of TAR use include:

Identification of Relevant Documents
Early Case Assessment/Investigation
Prioritization for Review
Categorization (By Issues, For Confidentiality or Privacy)
Privilege Review
Quality Control and Quality Assurance
Review of Incoming Productions
Disposition/Trial Preparation
Information Governance and Data Disposition

*References

(1) Losey, R. e-Discovery Team TAR Course. Available at https://e-discoveryteam.com/tar-course/. [Accessed 16. Feb. 2023]

(2) Grossman, M. and Cormack, G. (2017). Technology-Assisted Review in Electronic Discovery. [ebook] Available at: https://bolch-test.law.duke.edu/wp-content/uploads/2017/07/Panel-1_TECHNOLOGY-ASSISTED-REVIEW-IN-ELECTRONIC-DISCOVERY.pdf [Accessed 16 Feb. 2023].

(3) Grossman, M. and Cormack, G. (2016). Continuous Active Learning for TAR. [ebook] Practical Law. Available at: https://pdfs.semanticscholar.org/ed81/f3e1d35d459c95c7ef60b1ba0b3a202e4400.pdf [Accessed 16 Feb. 2018].

(4) Grossman, M. and Cormack, G. (2016). Scalability of Continuous Active Learning for Reliable High-Recall Text Classification. [ebook] Available at: https://plg.uwaterloo.ca/~gvcormac/scal/cormackgrossman16a.pdf [Accessed 16 Feb. 2023].

(5) Losey, R., Sullivan, J. and Reichenberger, T. (2015). e-Discovery Team at TREC 2015 Total Recall Track. [ebook] Available at: https://trec.nist.gov/pubs/trec24/papers/eDiscoveryTeam-TR.pdf [Accessed 16 Feb. 2023].

(6) “CONTINUOUS ACTIVE LEARNING Trademark Of Maura Grossman – Registration Number 5876987 – Serial Number 86634255 :: Justia Trademarks”. Trademarks.Justia.Com, 2020, https://trademarks.justia.com/866/34/continuous-active-86634255.html [Accessed 16 Feb. 2023].

(7) “CAL Trademark Of Maura Grossman – Registration Number 5876988 – Serial Number 86634265 :: Justia Trademarks”. Trademarks.Justia.Com, 2020, https://trademarks.justia.com/866/34/cal-86634265.html [Accessed 16 Feb. 2023].

(8) Dimm, B. (2016), TAR 3.0 Performance. [online] Clustify Blog – eDiscovery, Document Clustering, Predictive Coding, Information Retrieval, and Software Development. Available at: https://blog.cluster-text.com/2016/01/28/tar-3-0-performance/ [Accessed 16 Feb. 2023].

(9) Electronic Discovery Reference Model (EDRM) (2019). Technology Assisted Review (TAR) Guidelines. [online] Available at: https://www.edrm.net/wp-content/uploads/2019/02/TAR-Guidelines-Final.pdf [Accessed 16 Feb. 2023].

Additional Reading

Source: ComplexDiscovery

painting, Bear in aa library — Image: ComplexDiscovery

Image: ComplexDiscovery

ComplexDiscovery Editor’s Note: From time to time, ComplexDiscovery highlights publicly available or privately purchasable announcements, content updates, and research from cyber, data, and legal discovery providers, research organizations, and ComplexDiscovery community members. While ComplexDiscovery regularly highlights this information, it does not assume any responsibility for content assertions.

To submit recommendations for consideration and inclusion in ComplexDiscovery’s cyber, data, and legal discovery-centric service, product, or research announcements, contact us today.

Finding the ‘Just Right’ Approach: Goldilocks and Technology-Assisted Review in eDiscovery

Author

Rob Robinson

Rob Robinson is a technology marketer who has held senior leadership positions with multiple top-tier data and legal technology providers. He writes frequently on technology and marketing topics and publish regularly on ComplexDiscovery.com of which he is the Managing Director.
View all posts