AI Prompt to Improve Keyword Search

Image: Craig Ball with AI.

[EDRM Editor’s Note: The opinions and positions are those of Craig Ball. This article is republished with permission and was first published on August 4, 2024.]

Twenty years ago, I dreamed up a website where you would submit a list of eDiscovery keywords and queries and the site would critique the searches and suggest improvements to make them more efficient and effective. It would flag stop words, propose alternate spellings, and alert the user to pitfalls making searches less effective or noisy. I even envisioned it testing queries against a benign dataset to identify overly broad terms and false hits.

I believed this tool would be invaluable for helping lawyers enhance their search skills and achieve greater efficiency. Over the years, I tried to bring this idea to life, seeking proposals from offshore developers and pitching it to e-discovery software publishers as a value-add. In the end, a pipe dream. Even now, nothing like it exists.

The emergence of AI-powered Large Language Models like ChatGPT made me think what I’d hoped to bring to life years ago might finally be feasible. I wondered if I could create a prompt for ChatGPT that would achieve much of what I envisioned. So, I dedicated a sunny Sunday morning to playing “prompt engineer,” a whole cloth term for those who craft AI prompts to achieve desired outcomes.

[T]he techniques aren’t rocket science, though they require some familiarity with how electronically stored information is indexed and how search syntaxes differ across platforms. Okay, maybe a little rocket science. But if you’re using a tool for critical tasks, shouldn’t you know what it can and cannot do?
Craig Ball.

The result was promising, a significant step forward for lawyers who struggle with search queries without understanding why some fail. Most search errors I encounter aren’t subtle. I’ve written about ways to improve lexical search, and the techniques aren’t rocket science, though they require some familiarity with how electronically stored information is indexed and how search syntaxes differ across platforms. Okay, maybe a little rocket science. But if you’re using a tool for critical tasks, shouldn’t you know what it can and cannot do?

Some believe refining keywords and queries is a waste of time, casting keyword search as obsolete. Perhaps on your planet, Klaatu, but here on Earth, lawyers continue using keywords with reckless abandon. I’m not defending that but neither will I ignore lawyers’ penchant for lexical search. Until the cost, reliability, and replicability of AI-enabled discovery improve, keywords will remain a tool for sifting through large datasets. However, we can use AI LLMs right now to enhance the performance and efficiency of shopworn approaches.

How Does It Work?

The prompt below was developed and tested on ChatGPT 4o (for Omni), a subscription product that costs $20.00/month. I ran it in the free versions, too, and it seemed to work; but my experience is with 4o, and I commend the latest version to you as twenty bucks well spent.

To use the prompt, log in to ChatGPT and copy and paste the prompt below into the chat window (don’t hit “Enter” yet) then use the paperclip button to upload a discrete list of the keywords and queries for assessment. You can upload them in plain text, rich text, or as a Word document or PDF. Now, hit “Enter.” Depending upon the length of ChatGPT’s response, you may need to click “Continue Generating” or type “continue” into the chat box to force the application to complete its response.

——- Start of Prompt to Paste (start on next line) ——–

### AI Prompt for Analyzing Keywords and Boolean Queries

Introduction
Purpose of the analysis: to enhance the efficiency and accuracy of keyword and Boolean queries in retrieving relevant documents during discovery in litigation. Highlight the need to balance recall and precision to ensure all relevant documents are identified without disproportionate noise. The analysis aims to optimize search strategies based on a comprehensive set of parameters, offering query-specific feedback as a table and general guidance as a narrative to improve recall and precision for lexical search.

Analysis Framework

1. Stop Word Identification

Objective: Determine if proposed terms are likely to be stop words in e-discovery tools, which may be ignored during indexing and search.
Approach: Review each term against common stop word lists used by popular e-discovery platforms, as follows:

Relativity (dtSearch) Default Stop Words: The default noise word list consists of punctuation marks, single letters and numbers, and the following words: a, about, after, all, also, an, and, another, any, are, as, at, be, because, been, before, being, between, both, but, by, came, can, come, could, did, do, each, even, for, from, further, furthermore, get, got, had, has, have, he, her, here, hi, him, himself, his, how, however, i, if, in, indeed, into, is, it, its, just, like, made, many, me, might, more, moreover, most, much, must, my, never, not, now, of, on, only, or, other, our, out, over, said, same, see, she, should, since, some, still, such, take, than, that, the, their, them, then, there, therefore, these, they, this, those, through, thus, to, too, under, up, very, was, way, we, well, were, what, when, where, which, while, who, will, with, would, you, your
DISCO Default Stop Words: Stop words are words that are not indexed by DISCO search and will not get hits in search results. Matters created after April 29, 2019 will index all words and no longer remove stop words from the search index. Matters created prior to April 29, 2019 do not index the following stop words: a, an, and, are, as, at, be, by, for, if, in, is, it, of, on, or, that, the, their, then, there, these, they, to, was, with
Everlaw Default Stop Words: There are no stop or noise words; Everlaw indexes all words for content searches.
dtSearch Default Stop Words: The default noise word list consists of punctuation marks, single letters and numbers, and the following words: a, about, after, all, also, an, and, another, any, are, as, at, be, because, been, before, being, between, both, but, by, came, can, come, could, did, do, each, even, for, from, further, furthermore, get, got, had, has, have, he, her, here, hi, him, himself, his, how, however, i, if, in, indeed, into, is, it, its, just, like, made, many, me, might, more, moreover, most, much, must, my, never, not, now, of, on, only, or, other, our, out, over, said, same, see, she, should, since, some, still, such, take, than, that, the, their, them, then, there, therefore, these, they, this, those, through, thus, to, too, under, up, very, was, way, we, well, were, what, when, where, which, while, who, will, with, would, you, your
Logikcull Default Stop Words: There are no stop or noise words; Logikcull indexes all words for content searches.
IPRO ZyLAB ONE Default Stop Words: and, exclude, not, number range, or, precedes, quorum, to, within. If a term or combination of terms you are searching for contains a hyphen, that term will be found, even if you did not include a hyphen in your search query. For example, when you search for ’email’ or ‘e mail’, it will also find ‘e-mail’. However, ‘e-mail’ will only retrieve ‘e-mail’. In addition, ‘e mail’ will not find ’email’ or the other way around (’email’ will not find ‘e mail’). It is not possible to search for capitalized letters, since all terms in the dictionary are stored in lower case.
IBM Discovery Default Stop Words: a, about, above, after, again, am, an, and, any, are, as, at, be, because, been, before, being, below, between, both, but, by, can, did, do, does, doing, don, down, during, each, few, for, from, further, had, has, have, having, he, her, here, hers, herself, him, himself, his, how, i, im, if, in, into, is, it, its, itself, just, me, more, most, my, myself, no, nor, not, now, of, off, on, once, only, or, other, our, ours, ourselves, out, over, own, s, same, she, should, so, some, such, t, than, that, the, their, theirs, them, themselves, then, there, these, they, this, those, through, to, too, under, until, up, very, was, we, were, what, when, where, which, while, who, whom, why, will, with, you, your, yours, yourself, yourselves
Nuix Discover (formerly Ringtail) Default Stop Words: a, about, after, all, also, an, and, another, any, are, as, at, be, because, been, before, being, both, but, by, came, can, come, could, did, do, each, even, for, further, furthermore, get, got, had, has, have, he, her, here, hi, him, himself, how, however, i, if, in, indeed, into, is, it, its, just, like, made, many, me, might, more, moreover, most, much, must, my, never, not, now, of, on, only, or, other, our, out, over, said, same, see, she, should, since, some, still, such, take, than, that, the, their, them, then, there, therefore, these, they, this, those, through, thus, too, under, up, very, was, way, we, well, were, what, when, where, which, while, who, will, with, would, you, your
Exterro FTK Default Stop Words: a, able, about, across, after, ain’t, all, almost, also, am, among, an, and, any, are, aren’t, as, at, be, because, been, but, by, can, can’t, cannot, could, could’ve, couldn’t, dear, did, didn’t, do, does, doesn’t, don’t, either, else, ever, every, for, from, get, got, had, hadn’t, has, hasn’t, have, haven’t, he, her, hers, him, his, how, however, i, if, in, into, is, isn’t, it, it’s, its, just, least, let, like, likely, may, me, might, most, must, my, neither, no, nor, not, of, off, often, on, only, or, other, our, own, rather, said, say, says, she, should, shouldn’t, since, so, some, than, that, the, their, them, then, there, these, they, they’re, this, tis, to, too, twas, us, wants, was, wasn’t, we, we’re, we’ve, were, weren’t, what, when, where, which, while, who, whom, why, will, with, would, would’ve, wouldn’t, yet, you, you’d, you’ll, you’re, you’ve, your

2. Synonyms and Variants

Objective: Identify synonyms, spelling variants, British alternative spellings, related terms, common misspellings, and transpositions.
Approach: Use linguistic databases and thesauri to expand each term into potential variants that could capture relevant documents. Supply alternative spellings and common misspellings.

3. Industry-Specific Jargon and Abbreviations

Objective: Incorporate industry-specific language that might be used in relevant documents.
Approach: Consult industry glossaries and articles by experts to identify terms and abbreviations commonly used in the field (as specified or in the absence of a specification, as may be gleaned from the context of the queries submitted here)

4. Boolean Query Structure and Logic

Objective: Evaluate the logic and structure of each Boolean query to ensure alignment with search objectives.
Approach: Analyze each query for logical consistency, correct operator usage, and alignment with intended search parameters.

5. Search Syntax and Connectors

Objective: Ensure compatibility with the syntax used by specific e-discovery tools and the proper use of connectors and parentheses for logical grouping of operators
Approach: Adjust query syntax to match the requirements of different platforms (e.g., Relativity, OpenText Insight, DISCO, Nuix Discover, Everlaw, Logikcull).
– Identify common syntactic errors across tools, noting variations like:
– Relativity/dtSearch: “w/n”
– OpenText Insight: “NEAR/n”
– DISCO: “/n” for unordered terms, “+n” for ordered terms.

6. Wildcards and Stemming

Objective: Utilize wildcards and stemming to broaden term inclusion without sacrificing precision.
Approach: Evaluate opportunities to use wildcards or stemming effectively within each query and articulate such uses.

7. Special Characters and Indexing

Objective: Ensure that queries do not include characters that are excluded from indexing or reserved for special purposes.
Approach: Identify and remove or adapt special characters or reserved operators in queries.

8. Spaces and Punctuation

Objective: Understand how spaces and punctuation are treated in the index being searched.
Approach: Analyze the treatment of these elements within the tool’s indexing process and adjust queries accordingly.

9. Numeric Values and Short Words

Objective: Address potential indexing limitations for numeric values and short words.
Approach: Determine whether these elements are indexed and consider alternative search strategies if not.

10. Diacritical Marks

Objective: Address alternative spellings of words incorporating diacritical characters.
Approach: Evaluate whether the tool creates equivalencies for diacritical variations and adjust queries as necessary.

11. Case Sensitivity

Objective: Determine if the search tool supports different letter cases (e.g., SAT vs. sat).
Approach: Test queries for case sensitivity and adjust strategies accordingly.

**Objective:**

Evaluate the effectiveness of each keyword and Boolean query in retrieving relevant documents for litigation-related discovery. The analysis aims to optimize search strategies based on a comprehensive set of parameters, offering query-specific feedback and general guidance to improve recall and precision for lexical search.

**Instructions:**

– **Review Each Term:** Analyze each keyword against the specified Objectives and Approaches above, considering tool compatibility regarding syntax and character handling.

– **Analyze Boolean Queries:** Evaluate the structure and logic of each query for effectiveness and adherence to the tool’s syntax rules in furtherance of the specified Objectives and Approaches above.

– **Presentation of Review and Analysis:** Temperature 0. Present the results in a tabular format with each keyword/query presented as a row and each numbered Objective above addressed in a column.

– **Provide Feedback:** After individual assessments, supply a comprehensive essay with guidance on improving recall and precision in e-discovery lexical searches, incorporating insights from experts like Craig Ball^[1] (craigball.com and craigball.net) and The Sedona Conference Working Group 1, with attribution.

—EXAMPLE: ### Guidance Essay: Improving Recall and Precision in Lexical Search for eDiscovery0

**1. Pre-Search Preparation:**

– Understand the dataset’s sources, types, and organization. Engage subject matter experts for relevant terminology insights.

**2. Crafting Comprehensive Keyword Lists:**

– Develop exhaustive lists with synonyms, acronyms, and industry-specific jargon. Account for linguistic variations and common misspellings.

**3. Optimizing Boolean Logic and Search Syntax:**

– Refine Boolean logic for precision. Understand tool-specific syntax, like proximity search differences.

**4. Incorporating Wildcards and Stemming:**

– Use wildcards and stemming to broaden parameters without overreach.

**5. Handling Index Exclusions and Special Characters:**

– Recognize special character treatment and indexing criteria to avoid missed documents.

**6. Addressing Diacriticals and Case Sensitivity:**

– Ensure searches accommodate diacriticals and case variations.

**7. Continuous Refinement and Documentation:**

– Iterate and document search strategies for consistency and defensibility.

These strategies enhance eDiscovery processes by improving lexical search precision and recall. As Craig Ball highlights, “The efficacy of eDiscovery lies not just in technology, but in the thoughtful application of that technology to the unique demands of each case” (Ball, 2024).

———–End of Prompt (ends on prior line)———-

What Do You Think? Can You Do Better?

In my experience with AI-powered Large Language Models (LLMs), they often excel at some tasks while underperforming in others. When testing query sets, I was modestly pleased by the results and felt they could be valuable for users looking to avoid common mistakes in search formulation. However, in other trials, issues with list formatting caused ChatGPT to struggle, resulting in less-than-optimal outcomes.^[2]

The prompt provided here serves as a starting point for further development. Don’t hesitate to reformat the output, incorporate your own assessment criteria, or include specifics about your e-discovery platform and its unique features.
Craig Ball.

The prompt provided here serves as a starting point for further development. Don’t hesitate to reformat the output, incorporate your own assessment criteria, or include specifics about your e-discovery platform and its unique features. I saw improved results when I uploaded pertinent additional information, such as my own Primer on Processing in E-Discovery found here. Ideally, I’d also supply the search syntax for the discovery platform used for search. Experiment with different LLMs and tailor the prompt (and uploads) to fit each case. I am confident that my readers can build upon these ideas, and I encourage you to share your findings for the benefit of the legal community

Read the original release here., updated 08/07/2024.

Notes

[1] Before you conclude that I egomaniacally injected myself into the mix, I asked ChatGPT to assist with drafting and refining the prompt (because it tends to do a good job refining its own prompts) and the AI sucked me in from whatever dark recesses it explores. Feel free to take me out, coach.

[2] Oddly, the more I tweaked the parameters (or invited ChatGPT to do so) the less-and-less useful the output. It was almost as if the LLM started to get bored with the project. Ultimately, I scrapped the more heavily refined version of the prompt in favor of an early iteration. That encapsulates my frustration with AI LLMs–they seem to reach a point at which improvement is elusive. Take, for example, the AI-generated illustration accompanying this post. No amount of prompting succeeded in cajoling the system to change “lexial” to “lexical” or generate something without robots. I’m getting pretty darn tired of all the robots.

Assisted by GAI and LLM Technologies per EDRM GAI and LLM Policy.

Author

Craig Ball

Craig Ball is a Texas trial lawyer, computer forensic examiner, law professor and noted authority on electronic evidence. He limits his practice to serving as a court-appointed special master and consultant in computer forensics and electronic discovery and has served as the Special Master or testifying expert in computer forensics and electronic discovery in some of the most challenging and celebrated cases in the U.S. Craig is also EDRM’s General Counsel and a key contributor to many EDRM projects.

View all posts