Generative AI and eDiscovery – Adoption in the Courts

Image: Holley Robinson, EDRM.

[EDRM Editor’s Note: This article was first published November 15, 2024, and EDRM is grateful to Dr. Tristan Jenkinson for permission to republish. The opinions and positions are those of the author.]

Introduction

At the recent Legal 500 Commercial Litigation Conference in London, one of the panels focused on the question “Do the existing Practice Directions support the adoption of Generative AI?”

I wanted to provide a few thoughts of my own on the topic, as there is far more to be discussed than could ever have been covered in a 45 minute panel session… and in fact more than could be included in just this blog post!

The “Obvious” Use Case and PD57AD

One of the features of this question is that it does not specify what usage of Generative AI we are asking if we can adopt. For those of us working day to day in eDiscovery, perhaps the most obvious use case is to use Generative AI to conduct disclosure exercises (directly or indirectly). This might involve specifically identifying the data set(s) for disclosure, or perhaps utilizing Generative AI to assist in the process for disclosure – for example running document summarizations, which are then used to determine an initial disclosure set.

Where we are talking about the use of Generative AI to assist in the conduct of disclosure within the Business and Property Courts, I believe that the rules already have an answer…

Within Practice Direction 57AD we have paragraph 3.2 (3), which states (my emphasis added):

“Legal representatives… are under the following duties to the court… to liaise and cooperate with the legal representatives of the other parties to the proceedings (or the other parties where they do not have legal representatives) so as to promote the reliable, efficient and cost-effective conduct of disclosure, including through the use of technology”
Practice Direction 57AD, paragraph 3.2(3) (emphasis added).

It seems clear that there was some future proofing going on when the Practice Direction was written (as many of us would have hoped for!), as it does not specify or limit what types of technology are valid.

The suggested impact of this is that, if the use of Generative AI is reliable, efficient and cost effective, then the parties should be promoting its use. On the other hand, if Generative AI was not reliable, efficient or cost effective, then I think that eDiscovery practitioners would agree that they probably shouldn’t be relying on it.

If the use of Generative AI is reliable, efficient and cost effective, then the parties should be promoting its use.
Dr. Tristan Jenkinson.

For the conduct of disclosure, if these three factors can be demonstrated for a particular approach, then all parties should (in theory) promote its use. If all parties are promoting the use of an approach, then it stands to reason that it should be something that all sides can agree to being used.

Many would argue that this is a gross oversimplification. In addition, the opposite should also apply – if one party can call into question the reliability, efficiency or cost effectiveness of the use of Generative AI, then it would likely fall to the party seeking to rely on Generative AI to make their case as to why the usage should be allowed.

Nothing New?

For those of us who have been working in eDiscovery for a while, this may feel like nothing new… The situation is reminiscent of the introduction of predictive coding. While this is now a commonly used eDiscovery practice, this was not always the case.

While some early adopters were utilising predictive coding on their cases, backed up by witness statements from technical experts regarding its effectiveness, others took a more prudent approach, waiting for relevant case law that could then be relied upon.

In the UK, the first notable case law arrived in the form of “Pyrrho” (Pyrrho Investments Limited v MWB Property Limited). In a judgment dated 16 February 2016, Master Matthews laid out the reasons that predictive coding was approved in that case, which you can read here.

I would argue that the case for Generative AI is more complex than it was for predictive coding. For predictive coding, the technology was designed with a specific application in mind. Generative AI has been built with far more generic applicability. While we have new set of technological tools in the Generative AI toolbox, we still need to develop the specific workflows and approaches that may later become “standard” approaches to apply these tools effectively to eDiscovery cases. There is no single or standard pathway for doing this.

The case for Generative AI is more complex than it was for predictive coding. For predictive coding, the technology was designed with a specific application in mind. Generative AI has been built with far more generic applicability.
Dr. Tristan Jenkinson.

It is also worth bearing in mind that there are many different Generative AI technologies, each with its own pros and cons.

These additional complexities are likely to have an impact on how swiftly Generative AI approaches are developed and universally agreed upon – assuming that some consensus is eventually reached. It may be that each different approach developed will have to be independently demonstrated to be reliable, efficient and cost-effective, to argue for its use on live cases.

Workflow Limitations

As noted above, the specific workflows that may well eventually become standard processes are currently being developed. It may be helpful, however, to discuss workflows at a high level, especially with regard to current limitations with regard to Generative AI usage in eDiscovery matters.

One of the biggest limitations in the use of Generative AI in eDiscovery relates to the sizes of “Context Windows”. This is essentially the amount of information that such systems can consider at any one time, typically measured by the number of “tokens” in use. Tokens are not quite the same as words, as some words are split into multiple tokens, and tokens typically include punctuation. Roughly speaking 100 tokens would amount to approximately 75 words.

The version of ChatGPT which was first released to the public in November 2022 to much acclaim was GPT3.5 and could handle 4,096 tokens (so roughly 3,000 words). The latest version of ChatGPT is GPT o1, which can handle 128,000 tokens, so approximately 96,000 words. The latest version of Anthropic’s Claude model (3.5) can handle 200,000 tokens (around 150,000 words).

While these token values are expected to continue to rise, they would not currently be enough to consider the full content of an (even relatively modest) eDiscovery matter.

This means that currently our workflows have to be designed with this limitation taken into account. There is also a conflict with another limitation – cost.

The cost of using Generative AI is typically linked to the number of tokens you are asking it to process, as well as the size of the output (again measured in tokens). This means that each time you run a prompt you would typically get charged by the model on input and output (though that cost may be combined).

As an example, at the time of writing, based on https://docs.anthropic.com/en/docs/about-claude/models, the cost of using the most advanced Claude 3.5 model would be $3 for each million tokens of input and $15 for each million tokens of output.

Forgetting for a moment the limitations on the number of tokens that the model can cope with, let’s consider the potential cost over a modest data set.

Purely as an example, let’s say you have 100,000 files, each of which is 2,250 words.

Each file therefore would be around 3,000 tokens.
So with 100,000 files you have around 300 million tokens of input.
Just the input cost of a single prompt run over the full set of documents would be 300 x $3 = $900… and this excludes the tokens that would be required in the prompt to tell the system what you want it to do with the data.

These are two key considerations that have to be borne in mind when developing Generative AI workflows in eDiscovery at current.

Workflow Methodologies

That is not to say that there are not ways to work around such limitations. Several approaches have been developed to do this.

Some approaches work by first identifying a small set of data (using keyword searching/predictive coding etc), which is then analysed using the Generative AI model. This initial selection therefore limits the amount of content (and thus the number of tokens of input) that the model has to process.

It may be possible to set up an extension to this approach using batching, so that a larger numbers of documents could be considered – though this obviously has a cost implication.

Another approach that has been developed is to start by limiting to some dataset (as above). However, rather than using a single model to consider the content of the reduced dataset, an initial (cheaper) model is utilised to create summaries of each document, designed to pull out key content. Those summaries are then analysed by a more detailed (and expensive) model which is used to perform analysis, as well as report and identify content of interest. This approach has found great application usage in ECA and investigation type cases.

Now that we have discussed (albeit at a very high level) some of the workflow approaches, let’s look at some of the criteria mentioned in PD57AD.

Reliability

One of the key points that I try to make regularly is that Generative AI is a tool. Like any tool, if you want to use it, you should seek to understand any limitations or potential issues that could arise from using that tool.

Despite all the capabilities of Generative AI, there are some apparently simple things that it fails to do, as well as some substantial potential issues that could impact on “reliability”.

While focused on ChatGPT rather than Generative AI in general, the “Categorical Archive of ChatGPT Failures” by Ali Borji (https://arxiv.org/pdf/2302.03494) is an enlightening read. It discusses many of the areas where ChatGPT has historically had problems. Written in April 2023, the article is now slightly dated, and misses some of the more recent reported failures. It is still an intriguing read for those interested in what Generative AI cannot do, or has struggled with.

Amongst the more recently uncovered problems that Generative AI struggles with is the strawberry issue. This is perhaps a perfect example of Generative AI failing to do something simple.

The Strawberry Issue

This was a problem that was discovered initially with the 4o version of ChatGPT (which was released in May 2024). The issue was reported in June 2024 (covered in this bug report https://community.openai.com/t/incorrect-count-of-r-characters-in-the-word-strawberry/829618).

The image included in the bug report demonstrates the issue – rather surprisingly, ChatGPT is unable to correctly count the number of ‘r’s in the word ‘strawberry’.

When I was testing this in September 2024 as part of a separate presentation, I found that the original issue had been fixed – GPT version 4o was now correctly reporting that strawberry contains three ‘r’s, as shown below:

However, it appears that while this specific instance may be resolved, the underlying problem remains. If you follow up the strawberry query with the same question based on raspberry, you get the following:

This may be a case where fixing the underlying problem may not be possible without retraining the entire model. Because of the somewhat black box nature of ChatGPT, it could be that simple rules have been put in place specifically to address users asking about strawberries. This could potentially be seen as a further risk to reliability, since reported issues may appear to have been fixed, but the resolutions may just be superficial.

Date Cut Offs

One of the other key topics that I cover when talking about potential issues to be aware of when using Generative AI are date cut offs. Because the models are trained on large data sets, after the training occurs, the model has limited details about new events and information. While some newer models do utilise an aspect of live searching, these aspects typically contain considerably less data than the content within the trained models.

This is potentially an area to consider when it comes to reliability, though this will depend on the use case and if the use case relies on information from a date restricted training set. For example, if using the system to identify all documents which talk about a specific topic, then it might be inferred that this would not rely on an external date restricted training set. However, say you use the model to identify documents which could be relevant to violations of a new law or directive, then the model may need to use external training set to identify details of that law or directive, in which case using a date restricted training set could become an issue.

Hallucinations

This is a topic covered in more detail in a previous article. There is an inherent issue with Generative AI in that it can invent “facts” which it will present to you as being true, but are really fabrications. These are known as hallucinations. They are often very convincing and can be difficult to identify as false.

There is an inherent issue with Generative AI in that it can invent “facts” which it will present to you as being true, but are really fabrications.
Dr. Tristan Jenkinson.

The impact of hallucination is more obvious when using Generative AI more generally to create content or responses. In these cases the AI system can insert some content that it has invented. It may not be obvious away from the creation of content how hallucinations can have an impact. For example, why this may be an issue when looking to use Generative AI to identify relevant data.

One way that the issue can manifest is where AI systems hallucinate reasons to flag content as being relevant. This is something highlighted by John Tredennick and William Webber in their article “Will ChatGPT Replace Ediscovery Review Teams?”, where they discuss several examples where the reasoning provided for tagging a document relevant contained hallucinated details.

On a closely related topic, OpenAI (the company behind ChatGPT) have a tool called Whisper which is used in audio transcription. While it does generate content (i.e. text transcription), it should only be recognising the content of audio and transcribing it. Unfortunately, Whisper, and other applications based on it, have been shown to be susceptible to hallucinations – making up comments in the transcription that have never been made on the recording.

This is perhaps more concerning when there is a push to use such applications in the medical industry, and when you consider some of the statistics involved. Based on reporting from Associated Press, a researcher at University of Michigan found hallucinations present in 80% of the samples he inspected. Such hallucinations could have a serious impact. Based on the same Associated Press article, researchers from Cornell University found that:

“[N]early 40% of the hallucinations were harmful or concerning because the speaker could be misinterpreted or misrepresented.”
Associated Press, “Researchers say AI transcription tool used in hospitals invents things no one ever said,” October 26, 2024.

Hallucinations can therefore be a significant concern when looking at the reliability of Generative AI systems which could be implemented in the conduct of disclosure in an eDiscovery case.

Minimising Hallucinations

There is a considerable amount of research relating to minimising hallucinations. Some approaches have already been developed to try and mitigate their impact.

One of these approaches is using RAG (Retrieval Augmented Generation). This is where the model is linked to real word data and this is then used to “ground” the results. The idea behind the concept is that if the system is prompted or programmed to rely upon (and link to) information from those real world documents, it should minimise the possibility of incorrect information being fabricated. This approach is not fool-proof however, as the research from Tredennick and Webber, which is using this approach, demonstrates.

Recent research from the university of Oxford, published in nature (a summary can be found here) suggests a method that could be used to identify incidents where a model is most likely to hallucinate.

As the summary article details, the approach is focused on identifying specific queries which are likely to generate hallucinated responses. Essentially, the approach requires looking at the potential responses and looking at the level of “semantic entropy” for responses. Semantic entropy is analysing a sample of results and seeing how variable the meanings of those responses are. The concept behind this is that if the sample model responses contains few different meanings (low semantic entropy), the less likely the answer is to be a hallucination. If the sample model responses contain a large number of different meanings (high semantic entropy), then a hallucination is more likely.

The use of this methodology could lead to improvements in prompt engineering – improving how we ask Generative AI for responses to optimise our responses (assuming that we want to minimise hallucinations). However, it does come at a cost. For each prompt that you want to check for the likelihood of hallucinations, a number of samples responses have to be generated. This can significantly increase the costs of usage, because you need to run each prompt numerous times to generate a statistically meaningful sample, increasing the costs for using the Generative AI system in place. While costs are likely to come down in the future, at current such an approach may struggle when it comes to analysis of cost effectiveness.

An alternative or additional approach may be utilising the pool of responses to create an improved response. It may be possible that a methodology could be developed that takes the pool of responses and considers information only from the most frequent semantically similar responses to generate a form of amalgamated response. The theory behind that being that hallucinations should occur rarely, therefore if you have many responses with similar semantic content, then this frequently reported content is more likely to be correct. Again, as this would require multiple prompts to be run, there is a potentially significant impact on likely costs using such an approach.

Biases and inaccuracies

This is another topic that I have covered in other online articles, and not one that I will go back into further today, other than to state the (somewhat obvious) issue that if underlying data sets which Generative AI is trained on contain biases or inaccuracies, then the responses that you get may display the same features.

It is potentially an area that could be used in an attempt to undermine the reliability of a model.

Restrictions in Workflows

As discussed above, there are some restrictions in Generative AI systems that our workflows currently have to allow for. I briefly touched on two of the approaches, and they are worth considering from a reliability standpoint.

If you were to use Generative AI on a reduced data set to make decisions on (or advise) on documents for disclosure, then you would need to show that whatever process used to generate that initial data set I also reliable. Otherwise the whole process could be questioned.

In addition, if you are utilising a workflow that generates summaries, then analyses these summaries, this could easily face challenges if used for disclosure. This would be on the basis that a 100 page document could contain one line which is the smoking gun. In a summary there could be significant risk that the smoking gun line would not be pulled out or included in the summary, which is all that is then considered by the Generative AI system.

While disclosure in the Courts of England and Wales is not designed to be a scorched earth approach and proportionality can be applied, relying on the analysis of summaries is an area which I think is likely to face resistance, especially in high stakes, high value litigation or investigations. The approach was designed for (and appears to work well in) investigation and ECA type matters to quickly identify relevant material, but may not be the best approach for disclosure.

With this in mind, it may well be that the workflows designed in the future (if not already being built or put in place) will likely be a fluid mix of these approaches, together with predictive coding workflows. For example, the analysis of summaries may be used to find an initial tranche of relevant information that can then be used to train a predictive coding model run across the full documents.

Efficiency

The efficiency of Generative AI usage would likely be very much dependent on the use case. While there are definitely efficiency gains to be made by the use of Generative AI that does not necessarily mean that all possible uses of it will be efficient.

The efficiency question raised by the Practice Direction wording may well also have to consider the question “efficient compared to what”.

Consider a simple potential use case – Using Generative AI to perform a first pass review for relevance, the results of which are then given to subject matter experts to review for a final decision on what to disclose.

The above approach would (in all likelihood) be significantly more efficient than the single subject matter expert reviewing all files and deciding what to disclose. However, would it be more efficient than the same subject matter expert performing a predictive coding based review? The review set for the predictive coding review may well be similar to the set highlighted for review by Generative AI (given that predictive coding is already agreed to be a reliable approach). In which case there could potentially be only a small efficiency gain (if any).

As discussed above, there are likely to be many different workflows developed which will utilize different types of Generative AI systems to directly impact disclosure. It may be that each style of workflow would need to separately demonstrate their efficiency before being generally accepted.

Before the courts reach that level of general acceptance, as with predictive coding, discussed above, it is likely that some early adopters will put forward workflows supported by expert statements to support the usage which will be agreed with the other side.

Cost-Effectiveness

For similar reasons, gauging whether Generative AI workflows will be cost-effective, will be largely dependent on the details of the workflow itself.

While the technology may lend itself to efficiencies (especially over manual review), that does not necessarily mean cost-effectiveness will be clear cut. As discussed above, some of the methodologies that could potentially be implemented for limiting (or identifying the likelihood of) hallucinations require a potentially substantial amount of additional Generative AI prompts to be run, which could potentially increase the cost. At current, these approaches would likely not be cost-effective due to the cost impacts discussed earlier.

As with many new technologies, Generative AI systems are not necessarily cheap. In theory prices would be expected to reduce over time, as processing power (as well as token limits) increase. These technologies should therefore become more and more cost-effective as time goes by. It is highly likely that additional workflows will then be possible to implement, where they are simply not possible at present. Similarly it is likely that workflows not yet conceived of could be developed and demonstrated to be effective when the impact of token limits and cost is minimised.

The price of Generative AI is a point well made in a law.com article from Isha Marathe, which while written a year ago, still represents the concerns of many. For example, the article quotes Matt Jackson of Sidley Austin as stating that “[G]enerative AI comes with sticker shock”. This initial high pricing with costs expected to later fall is again not too different to TAR, which the article also references, referring to the “TAR tax” now being replaced by the “Generative AI tax”.

Another complication with the cost-effectiveness of Generative AI is that, as discussed above, Generative AI costs are often based on usage. This may not be easy to estimate prior to starting its use on a project. While such uncertainties are certainly not new to the world of eDiscovery, it does add additional cost uncertainty to projects.

Isha Marathe’s article mentioned above finishes on an insightful note from Quinn Emanuel’s Melissa Dalziel highlighting that, at least when the article was published, we do not yet have the metrics to know the cost implications of Generative AI eDiscovery workflows. Even a year on there is not a large amount of published information about different workflows and their potential cost impacts.

Coming up in Part 2

In part two, I will delve into some of the other use cases, and what we are seeing with Generative AI in the courts.

Read the original article here.

Assisted by GAI and LLM Technologies per EDRM GAI and LLM Policy.

Author

Dr. Tristan Jenkinson

Dr. Tristan Jenkinson is a Certified Computer Examiner, a Certified Fraud Examiner, a Chartered IT Professional, a Fellow of the British Computer Society, and a member of the EDRM Global Advisory Council. He has over 15 years of experience in digital forensic investigations and electronic discovery and has provided expert evidence to the High Court of England and Wales both as a directly appointed expert and as a single joint expert.

He advises clients with regard to data collection, digital forensic investigations, electronic evidence, electronic discovery, and related issues.

View all posts