An important aspect of search is the documenting of search results for each search. Documenting search results enables several important follow-on actions as listed below.
- Defensibility of search results (see next section for details on various methods of validating systems and actual search process).
- Communicating search methods and results both within internal legal e-discovery teams and to outside parties such as outside counsel and opposing parties.
- Monitoring and historical tracking progress of searches.
- Assessment of search strategies, search technologies and specific vendor selections.
The EDRM Search XML Specification presents a formal way of documenting the search process and authenticating search results. This section describes the specific items that are recommendations on various items that should be captured.
8.1. Results Overview
In order to capture the essential elements of search results, the following broad areas of search process and results need to be considered for appropriate documentation. As every matter differs in scale and expected level of reasonable diligence, the e-discovery team should review these potential metrics and documented actions.
- Overall document counts in the comprehensive ESI corpus that was searched.
- Collection substrata ESI Document counts broken by file type, custodian, date ranges.
- Loading state of the target corpus that indicates the batches or collections loaded, processed or staged.
- List of search queries, with a complete query specification that captures the search technology, search parameters, search user, the time of search.
- For each search query, capture meta-data of search results
- For each search result, identify the hit location within the result document – i.e., hit level results specifying where within the document a search term was found. Some systems highlight these terms for easy viewing and navigation.
- Additional overall search aspects such as the language of documents within the corpus, the regions in the document that were searched and the regions that were not searched and the regions that could not be searched
- Search accuracy metrics for each search and the overall matter.
- Search performance metrics in terms of items searched per unit of time and items retrieved per unit of time.
The following sections expand on various aspects of the above items. The e-discovery team should evaluate and adopt these potential documentation steps in light of the matter venue, scope and standard of care required to authenticate your search results.
8.1.1. Overall Corpus Counts
When recording search results, it is important also note the actual size of the corpus searched. This gives a perspective on selectivity of a search query. In general, overall size of the corpus is best indicated by the number of documents present in the corpus.
In order to meet goals for validation of search results (as outlined in Section 9), it is important to record and report the overall corpus size. As an example, if a search were to process 100 million documents and yields 10,000 documents as “hits”, the search can be considered quite effective in culling down the results. In contrast, a search that processes 20,000 documents and yields 10,000 documents may not be as effective.
If possible, the total individual or unique item count within the target corpus should be compared to the items reported during collection or processing. Technologies may interpret container items differently (examples include sub-attachments and embedded files within email). This can make such comparisons difficult, but the goal is to document the actual number of items searched and demonstrate that this count reflects all the items in the corpus minus any documented exceptions.
Additionally, the documentation of search results should also include document counts that are compound complete, or including all family members. For example, one attachment may contain a search “hit” but if the email family includes a parent email with 14 attachments including attached emails that also have attachments, then the search result including full families would be much larger than the one recorded “hit.” It is quite helpful for the e-discovery team to know the total number of documents returned including all parents and attachments as all of the resulting items might then need to be reviewed.
8.1.2. Corpus Breakdown
A breakdown of corpus into various categories is useful in analyzing and validating a search technique. Primary value is in categorizing an input corpus so that search hits within individual categories give the legal team the tools to further iterate and refine a search methodology. Often, this breakdown is referred to as corpus stratification and categories are called strata. Some of the useful breakdowns of corpus are listed below.
- By custodian
- By date range
- By tags (such as manual coding)
- By file/document type
- By loading state (see below)
- By file meta-data (such as document author, document title, document size)
- By language of documents
- By de-duplication status
- By previous search results
For each category above, the search hits in each category will be helpful in further analysis.
As an example, a search for a keyword may locate search hits predominantly in a certain date range. Also, in the date range where the search hits are located, the corpus may itself have a certain number of documents. Together, this information is useful in either narrowing or eliminating a certain date range from further consideration.
Another interesting breakdown is how current search results are distributed within a previous search results. If you have a broad search result that identified potentially relevant results, a new search that identifies more specific results within that search result is useful. Also, if that same search result identifies a different number of search results outside the previous search hit coverage, that fact can be used for further validation.
8.1.3. Loading States and Batches
Corpus and search results breakdown by loading state is a useful technique. Quite often, ESI for a legal case is brought into the case in batches or tiers. These batches reflect multiple collection scope or methodologies along with importance of certain ESI relative to other ESI. By breaking down the corpus of documents by batches as well as identifying search hits for a search across these loading batches, the legal case team can establish useful conclusions that can further drive their additional searches.
Because searches will be executed on incomplete batches or ESI collections, it is critical to record the status of the target collection so that supplemental searches can be run on subsequent batches or that target corpus can be searched again if additional criteria are added.
8.1.4. Search Query Recording
As part of search results recording, the search query that generated the results need to be recorded. The search query needs to be recorded in a form that allows re-execution of the search. A primary purpose of this is to allow testing and validation of repeatability of a search along with a correlation of the search query against the results.
Query recording needs to include all parameters of the search, including the search technology, the corpus that is targeted, specific meta-data properties, document regions, language, stemming, tokenization, and other properties.
Many search systems record this information, but the e-discovery team should confirm that every aspect of the query is logged and not changed when system defaults or parameters are subsequently modified.
8.1.5. Overall Search Results Meta-Data
Search results meta-data breaks down the search results in a form that allows a quick review of the results. Some of the items captured are:
- Total search hits in the form of documents which were identified. In some cases, document hits may be available on unique documents (i.e., after de-duplication).
- Total number of keyword terms hit in each document (if the search technology is keyword based).
- Total number of documents in each corpus stratum.
- The time when search was initiated.
- The operator that performed the search.
- The duration the search ran/executed.
- Document counts for exceptions (those that the search could not process).
8.1.6. Document-Level Search Results
Document-level search results enable the legal case team to track down a specific document where a search hit occurred. Also, a complete document-by-document review of search results requires identification and eventual retrieval of the document in a reviewable form, again necessitating such results.
On a practical level, a document level results report is essential for later authentication of potential evidence. Consider it part of the overall Chain-of-Custody that documents how these documents were selected and where they came from.
Document-level search results should also include an integrity check based on a hash computation of the document. Typical search systems perform this integrity check using an MD5 or SHA1 message digest of the document and storing that hash value. When document hits are identified, the document ID along with the hash value and its location (i.e., a pointer to where the original document exists) is captured. The EDRM XML standard defines several other properties for each document using XML Elements.
In addition, document-level search results may include one or more small portions of the document (snippets) to indicate the context of potential hits within a document. These snippets enable a review team to perform a quick review of search results.
8.1.7. Hit-Level Results
When a search identifies a document, the search operation has scored a hit. In the case of a search query that is based on a search term, it is possible to have that term be found multiple times within the same document. It is also possible for a search query that contains multiple keywords to have one or more search hits for some number of terms. If these keywords are connected by Boolean operations, it is possible for some subset of keywords to be present, but it may not constitute a hit since the entire query may not match the document contents.
Search results at the hit level therefore capture the keyword, and its potential hit position within the document. For queries that are not based on keywords (such as concept search), it may not be possible to identify a hit. However, if there is a sentence or paragraph level context that was responsible for the document to be selected, that is captured using a hit. The EDRM Search XML Guide captures hits in the form of Hit Position Descriptions. These report the locations within various documents where a particular hit was found. It provides positive confirmation that a particular search actually exists, without revealing the complete contents of the document. Remember that any subsequent conversion or reformatting of ESI may affect or negate any positional descriptions.
In addition to the document itself, the section within a document where a search hit occurs is also important to capture. Examples of keyword hit locations are:
- Track Changes
- Hidden Cells/Columns
- Document internal metadata fields
- File System meta-data fields
- Multilayer text – such as text below a PDF/TIFF image
- Image tags and other meta-data of other objects
In the case of fielded searches, the fields where a search query was applied, the actual field values need to be recorded as well. Additionally, some searches may produce a hit within a container of other objects (such as an email and its attachments). In these cases, search results should capture both the container object references as well as the contained object references so that a document hit can be correlated with the actual document/container.
8.1.8. Keyword Occurrence Counts
Another useful aspect of search results is recording of counts of keywords that appear within the entire search results. When a search query involves multiple keywords or when one or more of the queries produces stemming, wildcard or fuzzy-based variations, a complete count of total occurrences for each keyword is useful for evaluating the value of searching using certain keywords. In some instances, the keyword counts both at an aggregate level (totaled over all the variations) as well as counts based on an individual variation level would each be helpful. If a search query is based on other related terms, the search results should capture the occurrence counts for each related term. If exclusion criteria are used to filter out known non-relevant documents, these occurrence metrics can provide a useful way to monitor for changes in language usage over time or sources.
8.1.9. Language, Script and Encoding
Managing search results in a multi-language corpus can be tricky. A useful operation is to isolate the search results by language allowing specific language experts to review these results. To facilitate this as well as to allow for easy query iterations, search results should be categorized along languages. Additionally, some search systems may be able to categorize the language script used as well as how the characters of the language were encoded. Language encoding is a standard way to represent characters of a text document, with Unicode Encoding being one such standard.
8.1.10. Search Accuracy Measures
Search results accuracy measures are an indication of how well a particular search performs. The standard measures are Precision and Recall (described in Section 9)). In many instances, precision and recall are difficult to compute before a document-by-document review is completed.
A useful method for overcoming this is to use a known corpus and evaluate the accuracy against the expected results for this corpus. An example of this would be to take proposed criteria for possible privileged documents and executing the search on a prior corpus that has already been reviewed. An alternative is to create a smaller sub-corpus through random or manual selection and use this after performing a manual review.
8.1.11. Search Performance Measures
Search performance measures are based on how long each search takes to complete. The EDRM Metrics Project documents Search Performance measures under the Analysis Node. Typical measurements are in the form of number of milliseconds to complete a search query.
There are no standard expected search performance measures, since each search method could involve varying levels of complexity, and the time taken to complete a search would be a function of this complexity. It is still useful to record the actual search query, the overall corpus size/counts, the number of actual search hits found, and the amount of time it took to complete.