
Answers to the not so simple questions I wish I had asked sooner

Whether you are new to the ediscovery dataverse or an old hat, it seems as if there is always something new to learn in this strange industry we call home. Even as a 15-year veteran, I still find myself occasionally chuckling about being completely wrong about a data quirk or turn of phrase I thought I knew. To save you the embarrassment, or just to offer you a few nuggets of wisdom, here are the top 8 questions I wish I had asked sooner in my ediscovery journey.

What is really happening in the ediscovery process?
If you have found your way to this great website (edrm.net) and you are reading this blog, you have probably seen the EDRM model and the key phases of the ediscovery lifecycle. If you are still scratching your head, here is a quick and dirty breakdown of what happens in each step along the way:
Identification – Find whose data you want to look at and where it lives
Preservation – Prevent people from altering or destroying this data
Collection – Have experts use technology to copy and document the data itself and data about the data (metadata)
Processing – Turn 1s and 0s into a reviewable format and get rid of garbage data
Review – Use humans and tech to look at and categorize data by relevance and issues, get rid of more garbage
Analysis- Build your case using insights from review
Production – Share the non-garbage data relevant to the matter with interested parties

Fear not if your ediscovery matter does not neatly follow this great workflow — few do in real life. In practice, the ediscovery process may be much messier: you may have any number of things that cause you to backtrack, repeat a step, or even start the whale darn process over. This is completely normal. Just be sure to diligently document your process throughout and work with technology that is designed to adapt along with you.

What does ESI actually mean?
Pretty much anyone in the ediscovery space for more than a hot minute understands that ESI is electronically stored information, and yet this question actually packs an increasingly interesting punch. In an era of collaboration tools, ephemeral (self-destructing) messaging apps, the internet of things, and now Elon Musk’s proposed Neuralink chip embedded inside a human brain, answering this question is significantly more complicated. What is potentially relevant has evolved substantially since I began my career in ediscovery. The world has fundamentally shifted and we are using a wide array of new and legacy technologies to communicate, work, and interact. This paradigm shift has brought a veritable deluge of potentially relevant information to scour.
The simple answer is that ESI for the purposes of ediscovery is a moving target that will continue to evolve with the tools we utilize to conduct business and interact with each other. The key to not drawing in this sea of data is twofold. First, talk with custodians early to understand how and via what technology they communicate and conduct business. Second, prioritize and triangulate your data sources by digging into the most rich sources first and using insights from each successive data source to refine and narrow the scope of review in subsequent ones.
Atypical data will not be involved in every case and relevant data may not always be in all of the varied sources outlined, but the important first step is to discuss these ESI sources. Key categories today that should be discussed in any custodial interview or while scoping an ediscovery matter include the following:
- Electronic or hard copy documents (spreadsheets, presentations, documents, PDFs, etc.)
- Instant messaging (Jabber, Yammer, Facebook messenger)
- Collaboration tools (Slack, Teams)
- Text/SMS
- Messaging apps (We Chat, Whats App)
- Social media (Facebook, Instagram, Twitter, LinkedIn, Snapchat)
- Ephemeral messaging (Signal, Wikr, Telegram)
- Internet of things (Smart devices like Fitbits, smart watches, Alexa, Nest)
What is structured data?
There are two types of data you will encounter in the course of ediscovery: structured and unstructured. The difference comes down to the structure in which each type of data is stored, generated, and interacted with. Structured data resides in complex applications and relational databases like a financial trading system or customer relationship management systems like Salesforce. Structured data resides in fields that are generally organized into rows and columns, and may be a part of a larger system of interacting and related parts.
Unstructured data, the more prevalent data format in ediscovery today, is everything else. All of the user-generated information — from Word documents to email, Slack conversations to cat memes — falls into the unstructured category and constitutes the bulk of relevant data in most ediscovery matters today. The consistent thread that unifies unstructured data is the lack of an overarching organizational schema or structure. Each unstructured piece of ESI can stand alone and is not dependent upon a larger system.

Where do I find relevant data?
As with the questions about what ESI to look at has gained complexity, so too has the inquiry about where to locate many of these new and existing data types. Initially the transition from discovery to ediscovery simply meant shifting from filing cabinets to desktop computer files and Outlook folders. Factors like cloud computing, proliferation of bring your own device (BYOD) policies and third-party hosted applications have complicated the process of data identification. The rapid post-COVID-19 shift to remote work has also created a Zoom boom and opened up new data sources and locations to investigate.
Data architecture differs from organization to organization and the proliferation of third party managed and hosted applications including messaging apps, social media, and collaboration tools not to mention O365 and other enterprise-grade cloud systems makes it imperative that you speak with the tech experts within an organization early and engage external experts in the event that internal resources cannot paint a complete picture. There is no one right answer here, but places to look include but are not limited to: Company supplied computers, tablets and mobile devices, third-party applications and or cloud backups, Company shared drives and backups, Company managed cloud and backup, legacy backup devices, and more.
How big is a gig of data?
As with everything in the practice of ediscovery, that depends! Factors including data compression, file type, and format composition of a dataset can all impact the overall number of documents in a given GB. In general, if you have a dataset that is heavy on email, you can expect a larger data volume than if you have a dataset with CAD files and massive Excel spreadsheets or PDFs. I tended to use an estimate of 5,500-7,500 documents per GB when I was crafting an estimate without any insight into the composition of the relevant dataverse.
The consensus when discussing just this topic with industry peers was a ceiling of 10,000 (if the only in-scope data was email) and as low as 3000 (if I was aware the dataset had an extensive amount of large data volume file types). There is no hard and fast rule so ensure that any assumptions you use prior to investigating the composition are validated and adjusted as necessary throughout the process.
What is a custodian?
This is not another name for the janitor from the popular TV show “Scrubs,” rather, it is a blanket term for the person who has administrative control over a document or piece of ESI. In the case of email this would be the owner of the mailbox, in social media and collaboration tools it would be the account owner, and with SMS it would be the person to whom the phone number belongs. Generally, ediscovery practitioners will use custodians in identifying relevant data sources and prioritizing ediscovery for a matter.

What is the difference between cloud and on-premise?
To start with, cloud computing is much less scary and foreign than most people would guess. Cloud computing, despite all the hype and confusion, is, at its core, a very simple resource sharing model. It is not, as some regulators might have you believe, a series of interconnected tubes or some form of dark magic. The cloud is simply on-demand computer resource, generally storage and computation power. Or even more simply, cloud computing is using someone else’s servers. In terms of interacting with the technology and experience, it will not be greatly different between a cloud hosted technology and one hosted on premise in an organization’s data center.
Where there IS a difference between cloud and on-premise ediscovery tools is in the scale, speed, and type of analytics they can offer. Legacy on-premise solutions are limited by the amount of hardware that an organization has. In my past life at Gibson Dunn, if I wanted to take on a case that surpassed my current capacity, I had to buy new servers and expand my licenses. The process could be time-consuming and costly. Additionally, the advanced analytics of some of the next-gen technology on the market require so much computing power that a traditional server architecture simply could not support it without breaking the bank. So, if you have a massive matter, need really robust analytics, or have limited internal resources cloud solutions may work better.
When should you use TAR or AI?
Unlike some of the prior questions, I can offer a definite answer in the case of using AI. Which cases should you use AI on? All of them! When technology companies first began introducing AI into legal technology, the process was both complex and limited in functionality. Practitioners using an AI workflow needed linguists and statisticians to conduct rounds of review to develop a seed set to begin teaching an algorithm. Even with the right people, only certain types of cases had enough of the right types of data to train an algorithm on.
Thankfully, AI in next-en ediscovery tools today is better, faster, and smarter than its predecessor. With some platforms, you can opt into an AI workflow by simply switching the AI button on. Innovations in machine learning (like Google’s FastText and elastic search) allow the algorithm to learn more points of information about data and begin making recommendations that accelerate review speeds in dozens of decisions insteads of tens of thousands. The net result is that cases large or small can gain meaningful insights into the data quickly and through prioritization can dramatically increase their speed of review. With a price tag of free in newer tech, not using AI in every case is a missed opportunity!
But wait there’s more
After chatting with a dozen or so industry peers for this blog, it became apparent that while ediscovery appears at first to be easily understood it is surprisingly complex to master! To that end, I have another handful of seemingly straightforward yet deceptively complex ediscovery questions in the next installment of “questions we are too afraid to ask about ediscovery.” Until next time, stay curious my friends!