How Much Do Large Language Models Like GPT Cost? 

Using the right LLM for the job will save you money!

Big machines on a factory floor
Image: John Tredennick, Merlin Search Technologies and Midjourney

[EDRM Editor’s Note: The opinions and positions are those of John Tredennick, Dr. William Webber and Lydia Zhigmitova.] 

We recently wrote an article about Using Multi-LLM Systems for Investigations and Ediscovery: Fast, Flexible and Cost-Effective. One of the reasons we chose this path is because different large language models, e.g. GPT and Claude are better for different purposes. As we pointed out, using different LLMs for different jobs “affords the flexibility to use the best, most cost-effective LLM for each task required.”

In this article we focus on cost effectiveness. Why? Because pricing can make a difference when two tools can do the same job equally well. If one of the LLMs costs fifty times more than the other, you might want to pay attention. As we will show you, that is sometimes the case.

How Do Software Companies Charge for Large Language Models?

Many of us started using ChatGPT when it was free. Some of us moved to the $20 a month version to secure faster responses and access to later models but daily volumes were still limited. These licenses aren’t suitable for commercial purposes and are not our focus here.

For legal systems, we need to move to a commercial license and access GPT, the underlying engine for ChatGPT, through an API (application programming interface).

Pricing under these types of licenses is based on two considerations:

  1. The model being used, e.g. GPT 3.5 versus GPT 4.
  2. The volume of prompt tokens sent to or received from the LLM.

Before we get to specific prices for different models, let’s talk a bit about tokens. They are central to the analysis of which models are best suited for different purposes.

About Tokens

Tokens are the basic units of text or code that an LLM uses to process and generate language. Tokens can be characters, words, subwords or other segments of text or code. A token count of 1,000 typically equates to about 750 English words.

How many tokens in a prompt? This obviously depends on the length of information you send to the LLM. For our purposes, however, prompt lengths can be long because we typically send the text of ediscovery documents (or summaries of that text) to the LLM along with our request or question.

Why send the text? Because, for ediscovery at least, we typically ask the LLM to base its answer on the text of one or more discovery documents (email, documents, transcripts, etc.). As a result, the volume of text transmitted to the LLM can be much larger than the response.  

For pricing purposes, we have found that the average volume of prompts is roughly ten times larger than the average volume of  responses. Put another way, we typically send 1,000 tokens to the system for every 100 tokens received in response.

Why does this matter? Because most vendors charge one price for the volume of tokens submitted in the prompt (input) and another, larger price for the volume of tokens returned by the LLM in its answer (output).

Commercial Pricing

Here is a chart showing current pricing for two different LLMs: GPT (from OpenAI and Microsoft) and Claude 2(from Anthropic).[1] Both GPT and Claude come in different models, with pricing matching their capabilities.[2]Microsoft and OpenAI offer the same prices so they are grouped together.

Comparative Pricing for GPT and Claude 2
LLM ModelPrompt (per 1,000 tokens)Completion (per 1,000 tokens)Cost per Document Processed
GPT-3.5 Turbo (4K)$0.0015$0.0020$0.0034
Claude Instant (100K)$0.0016$0.0055$0.0044
GPT-3.5 Turbo (16K)$0.0030$0.0040$0.0068
Claude 2 (100K)$0.0110$0.0327$0.0286
GPT 4 (8K)$0.0300$0.0600$0.0720
GPT 4 (32K)$0.0600$0.1200$0.1440

The references in parentheses represent the size of the context window for each model. In essence, the context window limits how large the prompt can be for each request. The higher the prompt limit, the better, but costs increase as well. You can read more about context windows here: Are LLMs Like GPT Secure?.

Prices per 1,000 tokens are relatively small. To put things in perspective, we estimated the cost to send and return information for a single document that is 2,000 tokens in length (about 1,500 words).

You can quickly see that the cost to use GPT 4 is much greater than for lesser models like GPT 3.5. This chart shows relative pricing as a multiple of the cheapest model: GPT 3.5 (4K):

Cost comparison of different LLMs compared to GPT 3.5, bar graph, with GPT 4 highest, Claude and GPT 3.5 smaller
Image: Merlin Search Technologies

GPT 4 (32K) costs over 40 times more than GPT 3.5. GPT 4 (8K) costs more than 20 times GPT 3.5. Do we need the power of GPT 4 for every task we might run?

That is the question we will address in the next section.

Getting Smart About How We Use LLMs

We chose to build a multi-LLM framework because we have found that lower-priced LLM models can do certain tasks as well as the higher-priced version and often do them much more quickly. For example, we typically summarize documents based on a topic request and submit the summary to the LLM for analysis rather than the entire document. This lets us submit more summaries to the LLM for analysis at one time than would be possible if we submitted entire documents.

For summaries, we often recommend using GPT 3.5 (4K) or GPT 3.5 (16K). We then send the summaries to GPT 4 (8K or 32K) for more refined analysis. The same logic applies to using Claude Instant  (100K) for document summaries and Claude 2 (100K) for reports. GPT 3.5 and Claude Instant both do an excellent job at summarizing documents; our testing doesn’t show a measurable improvement with the more expensive models. However, we prefer the stronger models, GPT 4 and Claude 2 for final reports and recommendations.

With a multi-LLM framework, our clients can mix and match LLMs and individual models to get the most cost-effective results.

Here is a slightly more complicated chart showing how this approach can save on LLM costs using different combinations of LLMs and models.

Cost of Different LLM Combinations Over 1,000 Documents 
Summary LLMReporting LLMSummary Costs Per 1,000 DocumentsReporting Costs Per 1,000 DocumentsTotal Cost Per 1,000 DocumentsTotal Cost Per 100,000 Documents
GPT-3.5 Turbo (4K)GPT-3.5 Turbo (4K)$3.40$0.30$7$710
Claude Instant (100K)Claude Instant (100K)$4.36$0.33$9$905
Claude Instant (100K)Claude 2 (100K)$4.36$2.20$11$1,093
GPT-3.5 Turbo (4K)GPT 4 (32K)$3.40$12.00$19$1,880
Claude 2 (100K)Claude 2 (100K)$28.58$2.20$59$5,936
GPT 4 (32K)GPT 4 (32K)$144.00$12.00$300$30,000

You can quickly see how the costs vary between different combinations of LLMs and LLM models. In this case, we have highlighted two combinations that we think represent smart choices. We urge clients to consider GPT 3.5 or Claude Instant for summary work. Why? Because they are much faster than their more powerful siblings and less costly. And, the summaries don’t show the kind of improvement that justify the added costs in most cases. Where a client wants the additional horsepower provided by GPT 4 or Claude 2, it is easy to change the setting.

Why consider Claude Instant or Claude 2 in this mix? One simple reason is the larger context window. Claude 2 offers a three-times large context window which can be important for large documents. As an example, we can fit most transcripts into Claude Instant or Claude 2. They would have to be broken up to fit GPT’s smaller context window. This can make the difference in many cases. (We are thinking about checking text sizes before processing to automatically send larger documents to Claude 2 even if we use GPT for their analysis.)

Ultimately, the smart play depends on your matter and your documents. The one thing we can say for sure is that having a multi-LLM framework affords the flexibility to use the best, most cost-effective LLM for each task required. That is why we chose a different path than other vendors.

[1]There are other vendors on the market but these represent the two LLMs we currently use. You can apply the same analysis to different LLMs you may be considering.

[2]GPT pricing is per 1,000 tokens. Anthropic prices per million tokens. For comparative purposes, we normalized pricing using 1,000 tokens for each.

Authors

  • John Tredennick

    John Tredennick (JT@Merlin.Tech) is the CEO and founder of Merlin Search Technologies, a cloud technology company that has developed Sherlock®, a revolutionary machine learning search algorithm. Prior to founding Merlin Search Technologies, Tredennick had a distinguished career as a trial lawyer and litigation partner at a national law firm. With his expertise in legal technology, he founded Catalyst in 2000, an international e-discovery search technology company that was later acquired by a large public company in 2019. Tredennick's extensive experience is evident through his authorship and editing of eight books and numerous articles on legal technology topics. He has also served as Chair of the ABA's Law Practice Management Section.

  • Dr. William Webber

    Dr. William Webber (wwebber@Merlin.Tech) is the Chief Data Scientist of Merlin Search Technologies. With a PhD in Measurement in Information Retrieval Evaluation from the University of Melbourne, Dr. Webber is a leading authority in AI and statistical measurement for information retrieval and ediscovery. He has conducted post-doctoral research at the E-Discovery Lab of the University of Maryland and has over 30 peer-reviewed scientific publications in the areas of information retrieval, statistical evaluation, and machine learning. Dr. Webber has nearly a decade of industry experience as a consulting data scientist for ediscovery software vendors, service providers, and law firms.

  • Lydia Zhigmitova

    Lydia Zhigmitova (LZhigmitova@Merlin.Tech) is Senior Prompt Engineer at Merlin Search Technologies. She received her Masters in Philology from the Pushkin State Russian Language Institute, Moscow. Her thesis title was "Metaphors in Scientific Communication and Their Role in Teaching Russian as a Foreign Language" (Роль метафоры в научном лингвистическом тексте и её место на занятиях по РКИ). Ms. Zhigmitova has a decade of experience in applied linguistics and digital content management. She works with Merlin data scientists and product engineers to develop prompt templates for our new GenAI platform, Discovery Partner™. She also assists Merlin clients in best practices for effective prompts and helps develop Merlin’s AI-driven content generation and analysis toolkit. Ms. Zhigmitova is based in Ulaanbaatar, Mongolia.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.