Empirical Study of LLM Fine-Tuning for Text Classification in Legal Document Review

[Editor’s Note: This article was first published in December 2023 by IEEE and EDRM is grateful to Robert Keeling and Nathaniel Huber-Fliflet of our Trusted Partners, Sidley and Ankura for permission to republish. The opinions and positions are those of the authors.]

Abstract

The increased integration of Large Language Models (LLMs) across industry sectors is enabling domain experts with new text classification optimization methods. These LLMs are pretrained on exceedingly large amounts of data; however, practitioners can perform additional training, or “fine-tuning,” to improve their text classifier’s results for their own use cases. This paper presents a series of experiments comparing a standard, pretrained DistilBERT model and a fine-tuned DistilBERT model, both leveraged for the downstream NLP task of text classification. Tuning the model using domain-specific data from real-world legal matters suggests fine-tuning improves the performance of LLM text classifiers.

Our experiments demonstrate that the fine-tuned DistilBERT model consistently outperforms the “out of the box” pretrained DistilBERT model when applied to text classification. This observation underscores the importance of incorporating domain-specific data into an LLM’s fine-tuning for its subsequent deployment on a text classification task.
Fusheng Wei, Robert Keeling, Nathaniel Huber-Fliflet et al.

To evaluate the performance of text classification models, using these two Large Language Models, we employed two distinct approaches that 1) score a whole document’s text for prediction and 2) score snippets (sentence-level components of a document) of text for prediction. When comparing the two approaches we found that one prediction method outperforms the other, depending on the use case.

Keywords—LLM, MLM, fine-tuning, text classification, large language model, predictive modeling, TAR, predictive coding

I. INTRODUCTION

With recent advancements in Large Language Models (LLMs), it has become imperative for downstream industries to identify applications of LLMs within each business domain. Among innovative industries, the legal industry is one at the forefront of this pursuit, given its common practice of applying predictive modeling – known in the legal industry as ‘predictive coding’ or ‘Technology Assisted Review (TAR)’ – which is a popular tool used to augment a manual document review and the text classification process.

Initially, the integration of machine learning in legal disputes has involved traditional methods like Logistic Regression (LR) and Support Vector Machine (SVM). Recent developments in machine learning and artificial intelligence have expedited the need to incorporate deep learning into the TAR toolkit. As LLMs continue evolving into state-of-the-art deep learning methods, it naturally becomes viable and imperative to explore their applications in legal disputes.

Currently, there are two prominent architectures of LLMs: Masked Language Models (MLM) and Causal Language Models (CLM). BERT, and permutations of this model, represent the MLMs, while GPT and other generative models represent the latter. MLMs and CLMs are both built upon transformer architecture, which represents the foundation of Large Language Models. When using MLMs, the model is trained to predict masked tokens within the input sequence (e.g., a sentence). Whereas in CLMs, the model is trained to predict the next token in the input sequence.

While both types of models handle Natural Language Processing (NLP) tasks, their functions and use cases can vary widely. MLMs are frequently utilized for tasks such as text classification, sentiment analysis, and named entity recognition. Alternatively, CLMs specialize in tasks like text generation and summarization.

LLMs are initially pre-trained on extensive generic data, like BERT being pre-trained on Wikipedia and Google’s BooksCorpus data. Training an LLM with such copious data makes the model extremely robust, but simultaneously makes the model universal and not attuned to any specific domain. Additional training of an LLM, or “fine-tuning”, is critical to align the model with a specific use case to improve results.

Prominent pretrained LLMs can be fine-tuned and applied to specific tasks, such as text classification.

Fine-tuning an LLM leverages a set of domain-specific text exemplars as additional training data for the existing, pretrained model. Tuning is considered self-supervised learning, where words are masked randomly and used as labels to retrain a small set of parameters within the original model. This tuning method normally does not require human-labeled training data. Fine- tuned LLMs, using human-labeled training sets and applied in a text classification scenario, are becoming more popular in the legal domain. Generally, it is believed that this approach develops effective models, but concurrently alludes to the potential benefits of fine-tuning the LLM before implementing it for text classification .

There is the potential for performance improvements in NLP tasks when a fine-tuned LLM is used for text classification. The fine-tuning process acclimates the underlying LLM to the unique characteristics and nuances of the textual data within the domain-specific data before it is used for classification. While applying LLMs that have not been fine-tuned to text classification tasks often yields acceptable performance, experimenting with fine-tuned LLMs provides a compelling opportunity to further improve performance in certain NLP applications.

In this paper, we conducted a series of experiments that o examined the performance impact of fine-tuning an LLM (DistilBERT) in a text classification scenario. The experiments were conducted using three data sets from confidential, non- public, real-world legal matters across various industries. These data sets were comprised of unstructured data, including emails and other electronic document types such as Microsoft Office, PDFs, and text files. A subset of this data was used to fine-tune a pretrained DistilBERT model – the model was then applied to a text classification task for each matter’s data set. In our experiments, we compared a fine-tuned DistilBERT model to an “out of the box” pretrained DistilBERT model.

Our experiments demonstrate that the fine-tuned DistilBERT model consistently outperforms the “out of the box” pretrained DistilBERT model when applied to text classification. This observation underscores the importance of incorporating domain-specific data into an LLM’s fine-tuning for its subsequent deployment on a text classification task.

To assess the performance of fine-tuning, we used two distinct approaches. First, we applied each model to classify a document’s entire text and second, we applied each model to classify only snippets of text from the same document set. A snippet, in our experiments, is a component part of a document’s text, typically two or three sentences. We found that fine-tuning DistilBERT performed demonstrably better than the “out of the box” pretrained DistilBERT model. Additionally, when comparing the document-level and snippet-level results of the fine-tuned model, snippet classification outperformed document classification when applied to one data set, while document classification yielded superior results when applied to the other two data sets.
Fusheng Wei, Robert Keeling, Nathaniel Huber-Fliflet et al.

To assess the performance of fine-tuning, we used two distinct approaches. First, we applied each model to classify a document’s entire text and second, we applied each model to classify only snippets of text from the same document set. A snippet, in our experiments, is a component part of a document’s text, typically two or three sentences. We found that fine-tuning DistilBERT performed demonstrably better than the “out of the box” pretrained DistilBERT model. Additionally, when comparing the document-level and snippet-level results of the fine-tuned model, snippet classification outperformed document classification when applied to one data set, while document classification yielded superior results when applied to the other two data sets.

Finally, we compared the performance of the fine-tuned DistilBERT model with a traditional Logistic Regression model.

Our findings indicate that both approaches – fine-tuned LLMs and traditional Logistic Regression models – provide effective solutions for text classification in legal matters. The versatility of these distinct approaches suggests potentially applying new modeling strategies to improve the performance of text classification tasks in the legal domain.

Prior publications of text classification research in the legal domain are discussed in Section II. Section III details the experiment methodology and construction. The experimental results are presented in Section IV, and our findings and conclusions are summarized in Section V.

Machine learning techniques, such as text classification are well established in the legal domain with Logistic Regression and Support Vector Machine being two popular machine learning algorithms for this task [1]. These algorithms learn from features generated by tokenization from bag of words. Before applying transformer models to text classification, prior studies applied deep learning methods, such as Convolutional Neural Network (CNN) and Long Short-Term Memory (LTSM), for text classification in legal document review [2, 3, 4]. CNN demonstrated good performance, yet across various data sets no single algorithm consistently outperformed the others.

In recent years, LLMs have surpassed deep learning models as the state-of-the-art architecture in every NLP aspect. Created by Google in 2018 [5], BERT is an extremely prevalent LLM that allows for transfer learning in NLP tasks by fine-tuning “out of the box” pretrained models on domain-specific data for downstream applications. Zhao, Ye and Yang (2021) [6] studied the effectiveness of transfer learning using BERT in privileged document review when compared to Logistic Regression for the same text classification task. The experiments yielded mixed performance improvement results.

In “An Empirical Comparison of DistilBERT, Longformer and Logistic Regression for Predictive Coding” [7], the authors tested the application of DistilBERT and Longformer in text classification. The results demonstrated that Longformer performs better or similar to DistilBERT and Logistic Regression because Longformer can handle more tokens as input compared to the other algorithms. However, due to Longformer’s training and compute time, it is not practical to use with real-world document review projects. This study also briefly tested fine-tuning the DistilBERT model with domain- specific data. In Wei et al. (2022) [7], the LLM was fine-tuned with publicly available legal domain data and used to measure its performance in an active learning text classification task.

Our experiments were conducted on three data sets from confidential, non-public, real-world legal matters across various industries. These data sets were comprised of unstructured data, including emails and other electronic document types, such as Microsoft Office, PDFs, and text files. We reduced the data size for fine-tuning to improve the speed of the process and to avoid overfitting. The data for fine-tuning was limited by removing file types with large quantities of text and files that may contain unhelpful textual structure or patterns, such as Microsoft Excel files. The filtered fine-tuning data was further cleansed by removing email headers, email footers, URLs, and duplicative text.
Fusheng Wei, Robert Keeling, Nathaniel Huber-Fliflet et al.

Fusheng Wei, Robert Keeling, Nathaniel Huber-Fliflet et al.

Using text classification to classify snippets of text from documents is gaining popularity, especially in the legal domain [8, 9]. In this approach, documents are broken into snippets, a small passage of words usually ranging from 50 to 200 words, and the model is applied to all snippets from the document. The highest scoring snippet then represents the score for the whole document. Snippet classification augments Explainable AI and simplifies the explanation of why the model made its classification decision, further minimizing the black box nature of text classification.

III. EXPERIMENTS

A. DataSets

Our experiments were conducted on three data sets from confidential, non-public, real-world legal matters across various industries. These data sets were comprised of unstructured data, including emails and other electronic document types, such as Microsoft Office, PDFs, and text files. We reduced the data size for fine-tuning to improve the speed of the process and to avoid overfitting. The data for fine-tuning was limited by removing file types with large quantities of text and files that may contain unhelpful textual structure or patterns, such as Microsoft Excel files. The filtered fine-tuning data was further cleansed by removing email headers, email footers, URLs, and duplicative text. Table I provides the breakdown of fine-tuning data for each of the three data sets.

TABLE I. Data sets for MLM Fine-Tuning

Matter	Total Number ofDocuments	Number of Documents Usedfor Fine-Tuning
Project A	4,000,000	400,000
Project B	1,000,000	300,000
Project C	800,000	250,000

Table II provides the breakdown of the labels for the three data sets for the text classification experiments.

Read the entire article with a detailed discussion of the methodology, with a comprehensive list prompts and quantified results results by clicking the PDF below. Page through by clicking the down arrow at the bottom of the window.

ieee-2023-empirical-study-of-llm-finetuning-for-text-classification-in-legal-document-review

Assisted by GAI and LLM Technologies per EDRM GAI and LLM Policy.

Authors

Fusheng Wei

Dr. Fusheng Wei is a Senior Director at Ankura in Washington, DC. He has more than 20 years of experience in financial application system development, business and system analysis, and data analytics. Fusheng has extensive experience building application systems for the secondary mortgage market in such areas as loan acquisition, servicing, securitization, and credit risk transfer. Fusheng’s analytics expertise encompasses artificial intelligence, machine learning, deep learning, natural language processing, and visual analytics. He has developed software application systems throughout the life cycle, including project and application portfolio management. He earned his BS at Wuhan University and his PhD at the University of Massachusetts Amherst.
View all posts
Robert Keeling

Robert Keeling is a partner at Sidley Austin LLP and head of the firm's e-discovery and data analytics group. He is also the Chair of EDRM's Global Advisory Council.
View all posts
Nathaniel Huber-Fliflet

Nathaniel (Nate) Huber-Fliflet is a Senior Managing Director at Ankura, based in Washington, DC. He has 15 years of experience consulting with law firms and corporations on advanced data analytics solutions and legal technology services. His core expertise is in machine learning, speech recognition, data mining, and software development – all within the legal technology and services market space.
View all posts
Jianping Zhang, PhD

Dr. Jianping Zhang is a Senior Managing Director at Ankura in the Washington, DC office. He has over 25 years of proven leadership and technical experiences in solving complex data analytics problems. Specifically, his experience covers artificial intelligence, machine learning, predictive analytics, big data analytics, text analytics, and software development. Jianping’s extensive experiences in analytics address clients’ complex and evolving data challenges. Jianping has led the efforts in developing AI, machine learning, and advanced analytics products and solutions for government and commercial problems in various industries including financial, legal, and high tech. Jianping’s innovative research and development work resulted in more than 100 peer reviewed technical publications and two US patents. He earned his BS at Wuhan University in Computer Science and his PhD in Computer Science, Machine Learning from the University of Illinois at Urbana-Champaign.
View all posts
Adam Dabrowski

Adam Dabrowski is a Director, Data Science at Ankura. He is a data Analyst knowledgeable about statistical analysis and proficient in programming languages such as R, Python, SQL, and SAS with database and software knowledge in SQL, Tableau, and Power BI. Bringing 3 years’ experience in business and analytics across multiple sectors, Adam is a results-oriented Data Analyst adept at managing and analyzing large volumes of information. He earned his Bachelor's degree in Integrative Neuroscience at Binghamtom University and his Masters in Bioinformatics at Georgia Institute of Technology.
View all posts
Jinchao Yang, PhD

Dr. Jingchao Yang is the Director, Data and Software Engineering at Ankura. He earned his BS in Science-Applied Computer Science at Eastern Michigan University and his PhD in Geoinformation Sciences at George Mason University.
View all posts
Qiang Mao

Dr. Qiang Mao is Director, Data Science at Ankura. He earned his MS in Civil Engineering at Southeast University and his PhD in Structural Engineering at Drexel University. He is a licensed Professional Engineer.
View all posts
Han Qin, PhD

Han Qin is a Senior Director at Ankura based in Washington, DC. He has extensive experience in artificial intelligence and machine learning and is knowledgeable in advanced computer programming and data science. He has integrated advanced methodologies into applications to support the analysis of both structured and unstructured data in the financial, construction, and legal industries. He earned his BEng in Computer Software Engineering at Wuhan University, his MS in Geospatial Engineering at Eastern Michigan University and his PhD in Geospatial Data Science at George Mason University.
View all posts

Abstract

I. INTRODUCTION

II. RELATED WORK

III. EXPERIMENTS

A. DataSets

Authors

Leave a ReplyCancel Reply