2. Types of AI Used in eDiscovery

a. Clustering

Clustering is an example of unsupervised machine learning. The purpose of clustering is to group “similar” items together, which allows users to recognize the characteristics or topics that make them similar.  This allows users to learn something about the composition of the data set or to take action on a whole group of similar items (i.e., documents).  Clustering is unsupervised in the sense that users do not control the dimension(s) along which “similarity” is defined and do not have to label examples of items in each cluster in order to train the system, but the designer of the system generally does have to specify the features along which item similarity is to be measured and how many clusters there should be.

b. Email Threading

Most emails do not occur as single items but rather are part of an ongoing conversation.  When one replies to an email, it is typical to include the original email as part of the reply. Email threading works to identify all of the emails in the same conversation so they do not have to be reviewed separately.  Email threading reduces work and improves consistency by grouping the emails of a conversation together in a single unit.

c. Concept Search

Words tend to have different meanings because of their context.  For example, “strike” in the context of “bat” and “ball,” has a different meaning than “strike” in the context of “management” and “labor,” or “smack” and “hit.”  When using keywords to search for documents, searchers are often challenged by the difficulty of guessing the exact words that are used in those documents.  In addition, words can have multiple meanings, depending on their context (for example, “court” could be related to “judge” or to “tennis”). This ambiguity is called “polysemy.” Various words have the same meaning, for example, “doctor” and “physician.”  These words are called “synonyms.”  In addition to synonyms, words can have related meaning. For example, if a document contains words like “lawyer,” “attorney,” or “judge,” then that document is likely to be about something legal.  Any one of those words could be sufficient to identify a legal topic in a document.  Concept search is another unsupervised machine learning method in which the machine learns the context in which words are used and mathematically models the relationships among words.  Users can then search by meaning rather than by individual terms.  So, a search for “cups” would bring back documents about “mugs” and “glasses.”

d. Technology-Assisted Review (“TAR”)

Supervised machine learning is used extensively in the eDiscovery process in the U.S. and abroad to  rank or classify electronically stored information (“ESI”) to identify documents to produce, with “black letter law” supporting its use.[1] TAR or predictive coding is widely available in eDiscovery software products on the market today. While there is considerable variability in workflow and particulars, the core idea is that a computer learns to distinguish relevant from non-relevant documents based on the coding of human reviewers and can then classify unlabeled documents on its own.  This technology is now well established. Many, but not all, TAR algorithms are “language agnostic” such that they can rank or classify documents regardless of language.

e. Entity Recognition

Entity recognition is a supervised machine learning process where the computer learns to identify entities such as names of people, places, or companies, dollar amounts, job titles, account numbers, case/matter numbers, or other things.  Entities in a text are words, or numbers with patterns (e.g., XXX-XX-XXXX) but they often have properties of interest that go beyond the individual word or number. For example, one may want to search for all documents that contain a credit card number or other personally identifiable information (“PII”).  Searching for numbers can bring back too many documents that contain numbers, but are not credit card numbers, so entity recognition can be used to discern particular numerical patterns.

f. Sentiment Analysis

Sentiment analysis is used to identify emotional content of data, which can include excitement, anger, or other positive or negative emotions.   For example, a sentence like “I hated that movie,” would be classified as having a negative sentiment. On the other hand, “I found a marvelous pair of shoes,” would be analyzed as having a positive sentiment.

g. Machine Translation

It is becoming increasingly common to encounter foreign-language documents during discovery.  This, coupled with the overall growth in data volume, can make the costly and time-consuming process of human translation particularly burdensome.  When much of the translated content is irrelevant, the return on investment can seem quite low.  AI tools can be used to reduce, though not eliminate, the need for human translation. As an initial step, these tools can quickly identify those documents that contain foreign language text and list the languages they contain, allowing for more comprehensive planning.  Some tools can actually translate the text from one language to another.  Such machine-translation tools have improved to a considerable degree in the past few years.

h. Anonymization and Identity Masking

AI tools are used to improve the speed and accuracy of the anonymization and identity masking of personal, confidential, and/or privileged information contained in electronic records.  Some programs are able to systematically identify PII that then can be automatically redacted or “pseudonymized” (replaced with other information) to protect confidentiality.  For example, identifying all instances of a person’s name and replacing it with “Jane Doe 1.”  This can help, for example, in complying with the requirements of the European General Data Protection Regulation (“GDPR”). AI redaction tools can also copy redactions across duplicate documents, thereby ensuring consistency.


[1] Rio Tinto PLS v. Vale S.A., 306 F.R.D. 125 (S.D.N.Y 2015).