Craig Ball: ChatGPT Proves a Mediocre Law Student

[Editor’s Note: Craig Ball puts ChatGPT through its paces in the context of his law school class. EDRM is grateful to Craig for permission to republish. First published on Ball in Your Court, 1/27/2023]

ChatGPT Proves a Mediocre Law Student by Craig Ball — Image: Kaylee Walstad, EDRM

I recently spent a morning testing ChatGPT’s abilities by giving it exercises and quizzes designed for my law and computer science graduate students. Overall, I was impressed with its performance, but also noticed that it’s frequently wrong but never in doubt: a mechanical mansplainer!

If you’re asking, “What is ChatGPT,” I’ll let it explain itself: “ChatGPT is a large language model developed by OpenAI. It is a type of machine learning model called a transformer, which is trained to generate text based on a given prompt. It is particularly well-suited to tasks such as natural language processing, text generation, and language translation. It is capable of understanding human language and generating human-like text, which makes it useful for a wide range of applications, such as chatbots, question-answering systems, and text completion.” I’d add that it’s the first artificial intelligence tool that anyone can play with, and it’s generating enormous excitement.

Overall, I was impressed with its performance, but also noticed that it’s frequently wrong but never in doubt: a mechanical mansplainer!
Craig Ball

In terms of its legal knowledge, ChatGPT performed like a mediocre law student, performing well on essay questions and drafting exercises, but struggling with multiple choice questions. I couldn’t assign it a definitive score on the 25-question quiz as some questions required image analysis and I didn’t see a way to include illustrations in the chat interface. However, it performed poorly enough on the quiz for me to conclude that a passing score of 70 out of 100 points was out of reach.

One example of a question it struggled with was:

Which of the following is NOT an essential metadata value (a “metadata must”) to preserve and produce for a file?

Custodian
File Name
Last Modified Date and Time
Last Accessed Date and Time

The correct answer is Last Accessed Date and Time. Class readings underscore the unreliability of Last Accessed dates and times and I supply a table of “Metadata Musts” for preservation and production. Presumably trained on other sources, ChatGPT did not seem to understand the importance of a proper chain of custody for electronic evidence and replied:

Custodian is not considered an essential metadata value to preserve and produce for a file. The other options (File Name, Last Modified Date and Time, Last Accessed Date and Time) are all considered essential metadata values that should be preserved and produced for a file in order to provide important information about the file and its history.

Since it wasn’t performing well on my quiz, I asked,

Can you suggest some multiple choice quiz questions for students studying the fundamentals of electronic discovery and electronic data storage?

The first suggestion was:

What is the process of collecting, processing, and producing electronically stored information (ESI) in response to a legal request called?

a. Electronic discovery

b. Digital forensics

c. E-discovery

d. Data retrieval

In ChatGPT’s formulation “Electronic discovery” is the right response and “E-discovery” is wrong!

Another suggestion:

What is the most common format for ESI production?

a. PDF

b. TIFF

c. Native file format

d. ASCII

That’s a challenging question because, as much as I’ve focused on and written about forms of production, I’ve not come across metrics establishing the most common format. In supplying an answer key, ChatGPT identified Native file format as the correct answer, though I would have guessed “TIFF.” I certainly wouldn’t have included ASCII as a choice.

Here are a few other quiz questions suggested by ChatGPT. I’ve bolded what ChatGPT defined as the correct response and, in each instance, I’d conclude it’s either not the best answer or the question is so poorly constructed as to defy characterizing any single answer as the right one:

What is the process of identifying and eliminating irrelevant or privileged ESI called?

a. Culling

b. Filtering

c. Redaction

d. Processing

What is the process of reviewing ESI to identify relevant information called?

a. Review

b. Analysis

c. Coding

d. Searching

What is the process of converting ESI from its original format to a reviewable format called?

a. Conversion

b. Processing

c. Extraction

d. Production

What is the process of creating a backup copy of ESI for preservation purposes called?

a. Backup

b. Imaging

c. Archiving

d. Duplication

What is the process of destroying or disposing of ESI after it is no longer needed called?

a. Deletion

b. Erasure

c. Destruction

d. Purging

Encouraged that the law faculty might be safe for another semester, I turned ChatGPT to a drafting exercise:

You are to develop the best checklist of potential data sources that are candidates for legal preservation in any engagement—good enough that you’ll want to keep it to use in your own work. It does NOT have to be entirely original, and I encourage you to start with the best existing checklist(s) you can find and then make updates and improvements. Certainly, you should seek to add missing sources, but you may also see a better way to organize and present the content. Remember that this checklist is a list of sources and varieties not methods and, crucially, it should serve to enhance the completeness of the personal data map you will create (though it should NOT be limited to the scope of your personal digital footprint alone). Thus, generic checklist items like “Online sources” or “Local storage” won’t be sufficiently granular. Be sure to credit your sources and format it as a checklist.

ChatGPT responded:

Data Preservation Checklist

Social media accounts (Facebook, Twitter, Instagram, LinkedIn, etc.)
Email accounts (Gmail, Outlook, Yahoo, etc.)
Cloud storage accounts (Google Drive, Dropbox, OneDrive, etc.)
Mobile device data (text messages, call logs, contacts, etc.)
Computer files (documents, photos, videos, etc.)
Internet browsing history and cache
Online marketplace accounts (eBay, Amazon, Etsy, etc.)
Online banking and financial accounts (Paypal, bank accounts, credit cards, etc.)
GPS location data
Smart home device data (Amazon Echo, Google Home, etc.)
Wearable device data (Fitbit, Apple Watch, etc.)
Gaming accounts (Xbox Live, PlayStation Network, Steam, etc.)
Voicemail and answering machine messages
Instant messaging and chat logs (WhatsApp, Messenger, etc.)
Social media direct messages and private messages
Online dating profiles and messages
Online forum and message board posts
Online video and audio recordings (YouTube, Vimeo, etc.)
Online photo and image sharing accounts (Flickr, Imgur, etc.)
Online document collaboration and sharing accounts (Google Docs, Microsoft Office 365, etc.)

Source: Adapted from “Digital Preservation Checklist” by the National Archives and Records Administration (NARA) (https://www.archives.gov/digital-preservation/checklist) and “Digital Preservation Checklist for Individuals” by the Digital Preservation Coalition (https://www.dpconline.org/advice/digital-preservation-checklist-for-individuals)

Not bad, and certainly typical of what I get from students. Interestingly, both citations are dead links, so unacceptable from my students, and (arguably) ChatGPT simply ignored the requirement to format its answer as a checklist.

I had ChatGPT answer the question a second time sans the mandate that it credit sources. I was struck by the differences in the lists supplied and the disclaimer now appended:

Social media accounts (e.g. Facebook, Twitter, Instagram, LinkedIn)
Email accounts (e.g. Gmail, Yahoo, Outlook)
Cloud storage accounts (e.g. Google Drive, Dropbox, iCloud)
Online shopping accounts (e.g. Amazon, eBay, Etsy)
Financial accounts (e.g. bank accounts, credit card accounts, PayPal)
Mobile phone data (e.g. call logs, text messages, contacts)
Computer hard drives
USB drives and external hard drives
Social media messaging apps (e.g. WhatsApp, iMessage, Messenger)
Online forums and discussion boards
Online dating profiles
Online gaming accounts
Web browsing history and cookies
Internet service provider records
GPS data (e.g. location data from mobile devices)
Smart home device data (e.g. Amazon Echo, Google Home)
Vehicle tracking data (e.g. OnStar, Tesla)
Medical records (e.g. electronic health records)
Educational records (e.g. transcripts, grades)
Employee records (e.g. HR files, payroll records)

Note: This checklist is not exhaustive and may vary depending on the specific engagement or case. It is important to consult with legal and technical experts to determine the most relevant and appropriate data sources for preservation. Additionally, it is important to consider any privacy laws and regulations when collecting and preserving personal data.

I asked ChatGPT to identify “the most important legal opinions I should assign my students to read that will help them better understand electronic discovery in the United States” and it responded:

Zubulake v. UBS Warburg LLC, 217 F.R.D. 309 (S.D.N.Y. 2003) – This case established the “Zubulake standards,” which are widely used as a guide for electronic discovery in the United States. It lays out the obligations of parties to preserve and produce electronically stored information (ESI) and the sanctions that can be imposed for failure to comply.
Victor Stanley, Inc. v. Creative Pipe, Inc., 250 F.R.D. 251 (D. Md. 2008) – This case discusses the duty to preserve ESI and the consequences of spoliation (destruction or alteration of evidence). It also addresses the standard for imposing sanctions for spoliation.
The Sedona Conference® Commentary on Achieving Quality in the E-Discovery Process, The Sedona Conference (2008) – This commentary provides guidance on best practices for e-discovery, including the preservation and production of ESI, the use of technology-assisted review (TAR) and predictive coding, and the handling of privileged information.
Da Silva Moore v. Publicis Groupe, 287 F.R.D. 182 (S.D.N.Y. 2012) – This case addresses the use of TAR and predictive coding in e-discovery, and emphasizes the importance of cooperation and communication between parties when using these technologies.
Morgan Stanley Smith Barney LLC v. Skowron, No. 11-CV-8335 (S.D.N.Y. Mar. 29, 2013) – This case discusses the use of proportionality in e-discovery, and the court’s authority to limit the scope of e-discovery in order to reduce the burden and expense of producing ESI.
In re: Actos (Pioglitazone) Products Liability Litigation, MDL No. 6:11-md-2299 (W.D. La. 2013) – This case addresses the use of advanced analytics and data visualization tools in e-discovery, and highlights the importance of considering the costs and benefits of using these technologies in the e-discovery process.

Except for mischaracterizing a Sedona Conference Commentary as a “legal opinion,” they’re all fine suggestions, yet all are ten-twenty years old.

I asked ChatGPT a question posed by a student in my last class: “What is the difference between electronic discovery and computer forensics?” It opined:

Electronic discovery (e-discovery) and computer forensics are both related to the collection and analysis of electronic data, but they serve different purposes.

Electronic discovery is the process of collecting, processing, and producing electronic data in response to a legal request or investigation. It is typically used in civil litigation and is designed to identify and produce relevant electronic evidence that may be used in court. The goal of e-discovery is to find information that is responsive to the legal request and that is not privileged or protected by any other legal exceptions.

Computer forensics, on the other hand, is the process of collecting, preserving, analyzing, and presenting electronic evidence in a manner that is admissible in court. It is typically used in criminal investigations and is designed to identify, recover, and preserve evidence of a crime or other misconduct. The goal of computer forensics is to find and preserve evidence that may be used to prove or disprove a crime or other wrongdoing.

In summary, electronic discovery is focused on finding and producing relevant electronic evidence in response to a legal request, while computer forensics is focused on finding and preserving evidence of a crime or other misconduct. Both processes require the use of specialized tools and techniques to collect and analyze electronic data, but they serve different purposes and are used in different types of legal proceedings.

A good answer. My response in class focused on the relative accessibility and intelligibility of the ESI we deal with in e-discovery versus digital forensics, and I didn’t tie forensics to criminal investigations because so much of my work in the field has concentrated on civil cases. But I can’t say I did any better than the AI. I smiled more.

Finally, I asked who are the most important thought leaders in the field of electronic discovery? ChatGPT replied:

Ralph Losey: A well-known attorney and author who has written extensively on the topic of electronic discovery and has been recognized as a leading expert in the field.
Craig Ball: A forensic expert and attorney who has played a key role in shaping the laws and practices related to electronic discovery through his work as a consultant, expert witness, and author.
Tom O’Connor: A consultant, speaker, and author who is considered one of the leading experts on the use of technology in the legal field and has been instrumental in developing best practices for electronic discovery.
George Socha: A consultant, speaker, and author who is considered a leading expert in the field of electronic discovery and has been instrumental in the development of industry standards and best practices.
Maura Grossman: A consultant, speaker, and author who is considered one of the leading experts on the use of artificial intelligence and machine learning in electronic discovery and has been instrumental in advancing the field.
E-Discovery Team: A team of experts who have written extensively on the topic of electronic discovery and have been recognized as thought leaders in the field through their work as consultants, speakers, and authors.
Paul J. Lippe: A consultant, speaker, and author who is considered a leading expert in the field of electronic discovery and has been instrumental in the development of industry standards and best practices.
John Simek: A consultant, speaker, and author who is considered one of the leading experts on the use of technology in the legal field and has been instrumental in developing best practices for electronic discovery.
David Horrigan: A consultant, speaker, and author who is considered one of the leading experts on the use of technology in the legal field and has been instrumental in developing best practices for electronic discovery.
Daniel Garrie: A consultant, speaker, and author who is considered a leading expert in the field of electronic discovery and has been instrumental in the development of industry standards and best practices.

Kudos to Ralph Losey for topping the list and garnering a second mention as “E-Discovery Team.” I wouldn’t have considered Paul J. Lippe a candidate for an e-discovery-specific list, and I can’t very well commend ChatGPT for wokeness. One woman? No persons of color? No jurists? I’m flattered to be included among the other old, white men, but know well that there are plenty as or more deserving.

Author

Craig Ball

Craig Ball is a Texas trial lawyer, computer forensic examiner, law professor and noted authority on electronic evidence. He limits his practice to serving as a court-appointed special master and consultant in computer forensics and electronic discovery and has served as the Special Master or testifying expert in computer forensics and electronic discovery in some of the most challenging and celebrated cases in the U.S. Craig is also EDRM’s General Counsel and a key contributor to many EDRM projects.
View all posts