Replacing Attorney Review? Sidley's Experimental Assessment of GPT-4’s Performance in Document Review

Sidley Austin details findings of a test of GPT-4’s ability to step in on e-discovery, offering the pros and cons of using the tool for document review.

[EDRM Editor’s Note: EDRM is happy to amplify our Trusted Partners news and events. The opinions and positions are those of Sidley Austin.. This article was published in the December 13, 2023 issue of Legaltech News from Law.com, a publication of American Lawyer Media. It first appeared in The American Lawyer.]

ChatGPT by OpenAI came crashing into the world on November 30, 2022, and quickly captured everyone’s imagination, including that of businesses and lawyers eager to capitalize on the many ways artificial intelligence (AI) has been predicted to fundamentally change the way business is done, including how law is practiced. In this article, we offer a quantifiable look at whether GPT-4 is likely to live up to these expectations in the legal context and, more specifically, as it relates to document review in e-discovery.

ChatGPT is a large language, generative AI model, which means it can absorb a large quantity of written information and then generate new, original content after receiving a prompt from the user. ChatGPT is particularly interesting for legal practitioners because its generative capabilities have the potential to both alter and enhance attorneys’ current practices. For example, most in the legal community have heard by now the cautionary tale of lawyers who tried unsuccessfully to use ChatGPT to write legal briefs and were subsequently sanctioned by a federal district court. But that mishap certainly does not seal ChatGPT’s fate in the legal field; rather, it is an unfortunate example of the inexperienced use of a new technology without developing an understanding of its strengths and weaknesses.

Indeed, for e-discovery practitioners, ChatGPT and similar generative AI may cause a sea change in the not-so-distant future in how eDiscovery work is done. Specifically, ChatGPT’s evaluative and responsive capabilities have the potential to successfully replace or augment functions that are historically performed by attorneys or traditional evaluative tools like technology assisted review (TAR).
Colleen M. Kenny, Matt S. Jackson and Robert D. Keeling, Sidley.

Indeed, for e-discovery practitioners, ChatGPT and similar generative AI may cause a sea change in the not-so-distant future in how eDiscovery work is done. Specifically, ChatGPT’s evaluative and responsive capabilities have the potential to successfully replace or augment functions that are historically performed by attorneys or traditional evaluative tools like technology assisted review (TAR).

Shortly after the introduction of ChatGPT, OpenAI released a more advanced version of the technology, GPT-4, on March 14, 2023. This newer model is multimodal, which means it can make predictions on both text and images, while the original ChatGPT technology is limited to text only. GPT-4 pushed the boundaries of the technology’s performance even further in a number of ways. Specifically, GPT-4 outperformed ChatGPT across many tasks, exams, and benchmarks. For example, GPT-4 scored in the 90th percentile of the Uniform Bar Exam, while ChatGPT scored in only the 10th percentile. Currently, only the text model of this technology is supported in Microsoft Azure, but the vision capable version will be released in the near future.

Conducting a GPT-4 Experiment

To better understand the current capabilities of GPT-4 for e-discovery, we collaborated with global legal technology company Relativity to evaluate how standard GPT-4 would perform in coding documents for responsiveness. To perform this experiment, Sidley identified a prior, closed case in which documents had been coded by human reviewers for responsiveness. The now closed case involved a subpoena related to potential violations of the Anti-Kickback Statute. The subpoena requested documents that were responsive to 19 different document requests, including but not limited to organizational charts, discussions of compliance, physician communications, supplier agreements and correspondence related to payments to physicians. We provided a representative sample of these documents that reflected the richness of the total corpus of documents. It included 1,500 total documents from the closed case: 500 responsive documents and 1,000 non-responsive documents.

To better understand the current capabilities of GPT-4 for e-discovery, we collaborated with global legal technology company Relativity to evaluate how standard GPT-4 would perform in coding documents for responsiveness. To perform this experiment, Sidley identified a prior, closed case in which documents had been coded by human reviewers for responsiveness.
Colleen M. Kenny, Matt S. Jackson and Robert D. Keeling, Sidley.

We then provided document review instructions for GPT-4 that mirrored the review instructions employed by the attorneys who had reviewed those documents. GPT-4 evaluated each document individually, based on the review instructions, and reported whether the document was responsive according to a scoring system of negative one to four. GPT-4 was told to give a document a “negative one” if the document could not be processed; a “zero” if the document contained non-responsive junk or no useful information; a “one” if the document contained non-responsive or irrelevant information; a “two” if the document was likely responsive or contained partial or somewhat relevant information; a “three” if the document was responsive; and a “four” if the document was responsive and contained direct and strong evidence described in the responsiveness criteria. The experimental prompt also asked GPT-4 to provide an explanation for why the document was responsive by citing text from the document.

The experiment proceeded in two stages: (1) Sidley provided GPT-4 with the same review instructions given to the attorneys and collected data on GPT-4’s performance relative to the human review; (2) Based on initial output from stage one, the prompt for GPT-4 was modified to address ambiguities in the responsiveness criteria, which mirrored a quality control (QC) feedback loop that provided the same additional information given to the contract attorneys outside the original review instructions.

Read the entire article on the experiment and results obtained here.

Authors

Colleen M. Kenney

COLLEEN KENNEY, founder and head of Sidley’s eDiscovery and Data Analytics group, is a trial lawyer and one of the country’s preeminent authorities on eDiscovery. Colleen has more than 30 years of experience representing clients as first chair in complex financial, securities, antitrust, mass tort, products liability, financial, consumer, and employment class action litigation. Colleen is a Certified Public Accountant and a Certified Management Accountant. She is a longtime member of the Sedona Conference Working Group on Electronic Discovery, a member of the Electronic Discovery Institute, an active member of the Seventh Circuit E-Discovery Pilot Program on E Discovery, a faculty speaker of the E Discovery Institute and a member of the Duke Law Conference Discovery Proportionality Guidelines Team. Prior to law school, she worked for four years as an auditor for a large public accounting firm. Colleen has been a faculty member of the Annual Judicial Training Symposium for Federal Judges on behalf of the Electronic Discovery Institute board of directors and the Federal Judicial Center. This program includes views from plaintiffs and requesting parties, in-house counsel, defense counsel, government agency enforcement staff, non-lawyer technology specialists and judges. Described by clients as a “big-picture strategist” and “the most confident voice in the room,” Colleen was ranked in Chambers USA (2021–2023) and in Chambers Global (2022–2023) in E-Discovery & Information Governance section. Colleen was also recognized in Who’s Who Legal: Litigation and as a Top 100 High Stakes Litigators.

View all posts
Matt S. Jackson

MATT JACKSON brings 20 years of experience to his practice focusing on complex electronic discovery matters and all aspects of the Electronic Discovery Reference Model. Matt regularly advises Fortune 100 clients regarding best practices on information management, preservation, discovery-readiness solutions, defensible deletion and developments at the critical intersection of law and technology. Matt emphasizes a holistic approach to data management for investigations and litigation, including ways to leverage institutional knowledge across matters and defensible but practical solutions to data issues, such as the application of AI and Advanced Analytics. Matt has worked on numerous large-scale matters, including DOJ second requests and regulator driven investigations, class actions and commercial disputes in highly regulated industries, such as energy, financial services and life sciences. These matters often involve the production of hundreds of millions of pages of documents and the management of hundreds of terabytes of data. He is a member of Sidley’s E-Discovery Task Force and a published author and speaker on eDiscovery issues. Matt previously served as counsel at Sidley from 2006 to 2014. Prior to rejoining the firm, he worked as a Managing Director at a legal operations and compliance consulting firm. Matt graduated from the DePaul University College of Law. He earned his B.S. in Finance and Business Administration from DePaul University.

View all posts
Robert Keeling

Robert Keeling is a partner at Sidley Austin LLP and head of the firm's e-discovery and data analytics group. He is also the Chair of EDRM's Global Advisory Council.

View all posts

Replacing Attorney Review? Sidley’s Experimental Assessment of GPT-4’s Performance in Document Review

Conducting a GPT-4 Experiment

Authors

Leave a ReplyCancel Reply