[Editor’s Note: This article was first published in December 2023 by IEEE and EDRM is grateful to Robert Keeling and Nathaniel Huber-Fliflet of our Trusted Partners, Sidley and Ankura for permission to republish. The opinions and positions are those of the authors.]
Abstract
The increased integration of Large Language Models (LLMs) across industry sectors is enabling domain experts with new text classification optimization methods. These LLMs are pretrained on exceedingly large amounts of data; however, practitioners can perform additional training, or “fine-tuning,” to improve their text classifier’s results for their own use cases. This paper presents a series of experiments comparing a standard, pretrained DistilBERT model and a fine-tuned DistilBERT model, both leveraged for the downstream NLP task of text classification. Tuning the model using domain-specific data from real-world legal matters suggests fine-tuning improves the performance of LLM text classifiers.
To evaluate the performance of text classification models, using these two Large Language Models, we employed two distinct approaches that 1) score a whole document’s text for prediction and 2) score snippets (sentence-level components of a document) of text for prediction. When comparing the two approaches we found that one prediction method outperforms the other, depending on the use case.
Keywords—LLM, MLM, fine-tuning, text classification, large language model, predictive modeling, TAR, predictive coding
I. INTRODUCTION
With recent advancements in Large Language Models (LLMs), it has become imperative for downstream industries to identify applications of LLMs within each business domain. Among innovative industries, the legal industry is one at the forefront of this pursuit, given its common practice of applying predictive modeling – known in the legal industry as ‘predictive coding’ or ‘Technology Assisted Review (TAR)’ – which is a popular tool used to augment a manual document review and the text classification process.
Initially, the integration of machine learning in legal disputes has involved traditional methods like Logistic Regression (LR) and Support Vector Machine (SVM). Recent developments in machine learning and artificial intelligence have expedited the need to incorporate deep learning into the TAR toolkit. As LLMs continue evolving into state-of-the-art deep learning methods, it naturally becomes viable and imperative to explore their applications in legal disputes.
Currently, there are two prominent architectures of LLMs: Masked Language Models (MLM) and Causal Language Models (CLM). BERT, and permutations of this model, represent the MLMs, while GPT and other generative models represent the latter. MLMs and CLMs are both built upon transformer architecture, which represents the foundation of Large Language Models. When using MLMs, the model is trained to predict masked tokens within the input sequence (e.g., a sentence). Whereas in CLMs, the model is trained to predict the next token in the input sequence.
While both types of models handle Natural Language Processing (NLP) tasks, their functions and use cases can vary widely. MLMs are frequently utilized for tasks such as text classification, sentiment analysis, and named entity recognition. Alternatively, CLMs specialize in tasks like text generation and summarization.
LLMs are initially pre-trained on extensive generic data, like BERT being pre-trained on Wikipedia and Google’s BooksCorpus data. Training an LLM with such copious data makes the model extremely robust, but simultaneously makes the model universal and not attuned to any specific domain. Additional training of an LLM, or “fine-tuning”, is critical to align the model with a specific use case to improve results.
Prominent pretrained LLMs can be fine-tuned and applied to specific tasks, such as text classification.
Fine-tuning an LLM leverages a set of domain-specific text exemplars as additional training data for the existing, pretrained model. Tuning is considered self-supervised learning, where words are masked randomly and used as labels to retrain a small set of parameters within the original model. This tuning method normally does not require human-labeled training data. Fine- tuned LLMs, using human-labeled training sets and applied in a text classification scenario, are becoming more popular in the legal domain. Generally, it is believed that this approach develops effective models, but concurrently alludes to the potential benefits of fine-tuning the LLM before implementing it for text classification .
There is the potential for performance improvements in NLP tasks when a fine-tuned LLM is used for text classification. The fine-tuning process acclimates the underlying LLM to the unique characteristics and nuances of the textual data within the domain-specific data before it is used for classification. While applying LLMs that have not been fine-tuned to text classification tasks often yields acceptable performance, experimenting with fine-tuned LLMs provides a compelling opportunity to further improve performance in certain NLP applications.
In this paper, we conducted a series of experiments that o examined the performance impact of fine-tuning an LLM (DistilBERT) in a text classification scenario. The experiments were conducted using three data sets from confidential, non- public, real-world legal matters across various industries. These data sets were comprised of unstructured data, including emails and other electronic document types such as Microsoft Office, PDFs, and text files. A subset of this data was used to fine-tune a pretrained DistilBERT model – the model was then applied to a text classification task for each matter’s data set. In our experiments, we compared a fine-tuned DistilBERT model to an “out of the box” pretrained DistilBERT model.
Our experiments demonstrate that the fine-tuned DistilBERT model consistently outperforms the “out of the box” pretrained DistilBERT model when applied to text classification. This observation underscores the importance of incorporating domain-specific data into an LLM’s fine-tuning for its subsequent deployment on a text classification task.
To assess the performance of fine-tuning, we used two distinct approaches. First, we applied each model to classify a document’s entire text and second, we applied each model to classify only snippets of text from the same document set. A snippet, in our experiments, is a component part of a document’s text, typically two or three sentences. We found that fine-tuning DistilBERT performed demonstrably better than the “out of the box” pretrained DistilBERT model. Additionally, when comparing the document-level and snippet-level results of the fine-tuned model, snippet classification outperformed document classification when applied to one data set, while document classification yielded superior results when applied to the other two data sets.
Finally, we compared the performance of the fine-tuned DistilBERT model with a traditional Logistic Regression model.
Our findings indicate that both approaches – fine-tuned LLMs and traditional Logistic Regression models – provide effective solutions for text classification in legal matters. The versatility of these distinct approaches suggests potentially applying new modeling strategies to improve the performance of text classification tasks in the legal domain.
Prior publications of text classification research in the legal domain are discussed in Section II. Section III details the experiment methodology and construction. The experimental results are presented in Section IV, and our findings and conclusions are summarized in Section V.
II. RELATED WORK
Machine learning techniques, such as text classification are well established in the legal domain with Logistic Regression and Support Vector Machine being two popular machine learning algorithms for this task [1]. These algorithms learn from features generated by tokenization from bag of words. Before applying transformer models to text classification, prior studies applied deep learning methods, such as Convolutional Neural Network (CNN) and Long Short-Term Memory (LTSM), for text classification in legal document review [2, 3, 4]. CNN demonstrated good performance, yet across various data sets no single algorithm consistently outperformed the others.
In recent years, LLMs have surpassed deep learning models as the state-of-the-art architecture in every NLP aspect. Created by Google in 2018 [5], BERT is an extremely prevalent LLM that allows for transfer learning in NLP tasks by fine-tuning “out of the box” pretrained models on domain-specific data for downstream applications. Zhao, Ye and Yang (2021) [6] studied the effectiveness of transfer learning using BERT in privileged document review when compared to Logistic Regression for the same text classification task. The experiments yielded mixed performance improvement results.
In “An Empirical Comparison of DistilBERT, Longformer and Logistic Regression for Predictive Coding” [7], the authors tested the application of DistilBERT and Longformer in text classification. The results demonstrated that Longformer performs better or similar to DistilBERT and Logistic Regression because Longformer can handle more tokens as input compared to the other algorithms. However, due to Longformer’s training and compute time, it is not practical to use with real-world document review projects. This study also briefly tested fine-tuning the DistilBERT model with domain- specific data. In Wei et al. (2022) [7], the LLM was fine-tuned with publicly available legal domain data and used to measure its performance in an active learning text classification task.
Fusheng Wei, Robert Keeling, Nathaniel Huber-Fliflet et al.
Using text classification to classify snippets of text from documents is gaining popularity, especially in the legal domain [8, 9]. In this approach, documents are broken into snippets, a small passage of words usually ranging from 50 to 200 words, and the model is applied to all snippets from the document. The highest scoring snippet then represents the score for the whole document. Snippet classification augments Explainable AI and simplifies the explanation of why the model made its classification decision, further minimizing the black box nature of text classification.
III. EXPERIMENTS
A. DataSets
Our experiments were conducted on three data sets from confidential, non-public, real-world legal matters across various industries. These data sets were comprised of unstructured data, including emails and other electronic document types, such as Microsoft Office, PDFs, and text files. We reduced the data size for fine-tuning to improve the speed of the process and to avoid overfitting. The data for fine-tuning was limited by removing file types with large quantities of text and files that may contain unhelpful textual structure or patterns, such as Microsoft Excel files. The filtered fine-tuning data was further cleansed by removing email headers, email footers, URLs, and duplicative text. Table I provides the breakdown of fine-tuning data for each of the three data sets.
TABLE I. Data sets for MLM Fine-Tuning
Matter | Total Number ofDocuments | Number of Documents Usedfor Fine-Tuning |
Project A | 4,000,000 | 400,000 |
Project B | 1,000,000 | 300,000 |
Project C | 800,000 | 250,000 |
Table II provides the breakdown of the labels for the three data sets for the text classification experiments.
Read the entire article with a detailed discussion of the methodology, with a comprehensive list prompts and quantified results results by clicking the PDF below. Page through by clicking the down arrow at the bottom of the window.
ieee-2023-empirical-study-of-llm-finetuning-for-text-classification-in-legal-document-reviewAssisted by GAI and LLM Technologies per EDRM GAI and LLM Policy.