[Editor’s Note: Craig Ball has penned another deep dive into thorny issues in eDiscovery search, this time to illustrate limitations in indexing, the precursor to interactive or programmatic search. EDRM is grateful to Craig for permission to republish. First published on Ball in Your Court, 12/13/2022]
I’ve long been fascinated by electronic search. I especially love delving into the arcane limitations of lexical search because, awful Grinch that I am, I get a kick out of explaining to lawyers why their hard-fought search queries and protocols are doomed to fail. But, once we work through the Seven Stages of Attorney E-Discovery Grief: Umbrage, Denial, Anger, Angry Denial, Fear, Finger Pointing, Threats and Acceptance, there’s almost always a workaround to get the job done with minimal wailing and gnashing of teeth.
Three consults today afforded three chances to chew over problematic search strategies:
- First, the ask was to search for old CAD/CAM drawings in situ on an opponent’s file servers based on words appearing on drawings.
- Another lawyer sought to run queries in M365 seeking responsive text in huge attachments.
- The last lawyer wanted me to search the contents of a third-party’s laptop for subpoenaed documents but without the machine being imaged or its contents processed before search.
Most of my readers are e-discovery professionals so they’ll immediately snap to the reasons why each request is unlikely to work as planned. Before I delve into my concerns, let’s observe that all these requests seemed perfectly reasonable in the minds of the lawyers involved, and why not? Isn’t that how keyword and Boolean search is supposed to work? Sadly, our search reach often exceeds our grasp.
Have you got your answers to why they may fail? Let’s compare notes.
- When it comes to lexical search, CAD/CAM drawings differ markedly from Word documents and spreadsheets. Word processed documents and spreadsheets contain text encoded as ASCII or Unicode characters. That is, text is stored as, um, text. In contrast, CAD/CAM drawings tend to be vector graphics. They store instructions describing how to draw the contents of the plans geometrically; essentially how the annotations look rather than what they say. So, the text is an illustration of text, much like a JPG photograph of a road sign or a static TIFF image of a document—both inherently unsearchable for text unless paired with extracted or OCR text in ancillary load files. Bottom line: Unless the CAD/CAM drawings are subjected to effective optical character recognition before being indexed for search, lexical searches won’t “see” any text on the face of the drawings and will fail.
- M365 has a host of limits when it comes to indexing Cloud content for search, and of course, if it’s not in the index, it won’t turn up in response to search. For example, M365 won’t parse and index an email attachment larger than 150MB. Mind you, few attachments will run afoul of that capacious limit, but some will. Similarly, M365 will only parse and index the first 2 million characters of any document. That means only the first 600-1,000 pages of a document will be indexed and searchable. Here again, that will suffice for the ordinary, but may prove untenable in matters involving long documents and data compilations. There are other limits on, e.g., how deeply a search will recurse through nested- and embedded content and the body text size of a message that will index. You can find a list of limits here (https://learn.microsoft.com/en-us/microsoft-365/compliance/limits-for-content-search?view=o365-worldwide#indexing-limits-for-email-messages) and a discussion of so-called “partially indexed” files here (https://learn.microsoft.com/en-us/microsoft-365/compliance/partially-indexed-items-in-content-search?view=o365-worldwide). Remember, all sorts of file types aren’t parsed or indexed at all in M365. You must tailor lexical search to the data under scrutiny. It’s part of counsels’ duty of competence to know what their search tools can and cannot do when negotiating search protocols and responding to discovery using lexical search.
- In their native environments, many documents sought in discovery live inside various container files ranging from e-mail and attachments in PST and OST mail containers to compressed Zip containers. Encrypted files may be thought of as being sealed inside an impenetrable container that won’t be searched. The upshot is that much data on a laptop or desktop machine cannot be thoroughly searched by keywords and queries by simply running searches within an operating system environment (e.g., in Windows or MacOS). Accordingly, forensic examiners and e-discovery service providers collect and “process” data to make it amenable to search. Moreover, serial search of a computer’s hard drive (versus search of an index) is painfully slow, so unreasonably expensive when charged by the hour. For more about processing ESI in discovery, here’s my 2019 primer (http://www.craigball.com/Ball_Processing_2019.pdf)
In case I don’t post before Chanukah, Christmas and the New Year, have a safe and joyous holiday!