Sidley Austin details findings of a test of GPT-4’s ability to step in on e-discovery, offering the pros and cons of using the tool for document review.
[EDRM Editor’s Note: EDRM is happy to amplify our Trusted Partners news and events. The opinions and positions are those of Sidley Austin.. This article was published in the December 13, 2023 issue of Legaltech News from Law.com, a publication of American Lawyer Media. It first appeared in The American Lawyer.]
ChatGPT by OpenAI came crashing into the world on November 30, 2022, and quickly captured everyone’s imagination, including that of businesses and lawyers eager to capitalize on the many ways artificial intelligence (AI) has been predicted to fundamentally change the way business is done, including how law is practiced. In this article, we offer a quantifiable look at whether GPT-4 is likely to live up to these expectations in the legal context and, more specifically, as it relates to document review in e-discovery.
ChatGPT is a large language, generative AI model, which means it can absorb a large quantity of written information and then generate new, original content after receiving a prompt from the user. ChatGPT is particularly interesting for legal practitioners because its generative capabilities have the potential to both alter and enhance attorneys’ current practices. For example, most in the legal community have heard by now the cautionary tale of lawyers who tried unsuccessfully to use ChatGPT to write legal briefs and were subsequently sanctioned by a federal district court. But that mishap certainly does not seal ChatGPT’s fate in the legal field; rather, it is an unfortunate example of the inexperienced use of a new technology without developing an understanding of its strengths and weaknesses.
Indeed, for e-discovery practitioners, ChatGPT and similar generative AI may cause a sea change in the not-so-distant future in how eDiscovery work is done. Specifically, ChatGPT’s evaluative and responsive capabilities have the potential to successfully replace or augment functions that are historically performed by attorneys or traditional evaluative tools like technology assisted review (TAR).
Shortly after the introduction of ChatGPT, OpenAI released a more advanced version of the technology, GPT-4, on March 14, 2023. This newer model is multimodal, which means it can make predictions on both text and images, while the original ChatGPT technology is limited to text only. GPT-4 pushed the boundaries of the technology’s performance even further in a number of ways. Specifically, GPT-4 outperformed ChatGPT across many tasks, exams, and benchmarks. For example, GPT-4 scored in the 90th percentile of the Uniform Bar Exam, while ChatGPT scored in only the 10th percentile. Currently, only the text model of this technology is supported in Microsoft Azure, but the vision capable version will be released in the near future.
Conducting a GPT-4 Experiment
To better understand the current capabilities of GPT-4 for e-discovery, we collaborated with global legal technology company Relativity to evaluate how standard GPT-4 would perform in coding documents for responsiveness. To perform this experiment, Sidley identified a prior, closed case in which documents had been coded by human reviewers for responsiveness. The now closed case involved a subpoena related to potential violations of the Anti-Kickback Statute. The subpoena requested documents that were responsive to 19 different document requests, including but not limited to organizational charts, discussions of compliance, physician communications, supplier agreements and correspondence related to payments to physicians. We provided a representative sample of these documents that reflected the richness of the total corpus of documents. It included 1,500 total documents from the closed case: 500 responsive documents and 1,000 non-responsive documents.
We then provided document review instructions for GPT-4 that mirrored the review instructions employed by the attorneys who had reviewed those documents. GPT-4 evaluated each document individually, based on the review instructions, and reported whether the document was responsive according to a scoring system of negative one to four. GPT-4 was told to give a document a “negative one” if the document could not be processed; a “zero” if the document contained non-responsive junk or no useful information; a “one” if the document contained non-responsive or irrelevant information; a “two” if the document was likely responsive or contained partial or somewhat relevant information; a “three” if the document was responsive; and a “four” if the document was responsive and contained direct and strong evidence described in the responsiveness criteria. The experimental prompt also asked GPT-4 to provide an explanation for why the document was responsive by citing text from the document.
The experiment proceeded in two stages: (1) Sidley provided GPT-4 with the same review instructions given to the attorneys and collected data on GPT-4’s performance relative to the human review; (2) Based on initial output from stage one, the prompt for GPT-4 was modified to address ambiguities in the responsiveness criteria, which mirrored a quality control (QC) feedback loop that provided the same additional information given to the contract attorneys outside the original review instructions.
Read the entire article on the experiment and results obtained here.