Search in the EDRM context can mean employing any of a number of techniques across a variety of data. Usually the objects being searched are documents in a case, but even there the documents can take several forms. The proliferation of computer files (known as Electronically Stored Information, or ESI), whether common files like documents created with Microsoft Word© or PowerPoint©, email stored as individual message files or together in an Outlook or Notes data file, OCR files created from scanned paper documents, or even more exotic files such as those created by a CADCAM program, have caused the need for larger computer systems to store and manage the data in these. Also, the text within a document is not the only data that can be searched. Information about the documents themselves, known as metadata, can also be stored (usually in a database) and searched. This section discusses known search techniques used by existing computer systems to search the data available.
Search tools and methodologies have numerous applications during the e-discovery phase of the litigation lifecycle. The following provides a real life example of the processes and challenges related to using search and how these challenges can be mitigated.
Attorney James Smith is working on a new case involving a motorcycle accident. The plaintiff is claiming that his local garage failed to spot the leaking brake fluid from his BMW 1150 motorcycle when he was at the garage for maintenance, which caused a mechanical failure leading to the accident. Attorney Smith, who is representing the defendant garage, has a database containing thousands of documents, including email to and from the plaintiff and the defendant, email from a mailing list for motorcycle aficionados that both plaintiff and defendant participated in, and OCR’d documents including maintenance records and receipts from the garage.
Attorney Smith wishes to find the responsive documents as quickly as possible, without having to read each of the thousands of documents in the database. In addition, he wants to be sure he does not fail to mark any privileged documents appropriately. The database has a search feature, so he decides to use it. Being familiar with Google, he enters the following search terms:
motorcycle maintenance records
This type of search, using a list of key words, is known as a keyword search. The set of documents that are returned all contain at least one of the words in his list of search terms. This is his result set.
The search engine that Attorney Smith is using highlights the words that matched his search when he looks at the documents. He notices that the results contain maintenance records for automobiles as well as motorcycles, and in a couple of cases contained email about music, not maintenance, records. He changes his search terms to:
motorcycle AND “maintenance records”
Putting in the quotation marks means that his result set will only contain matches for the entire phrase contained within the quotation marks, not for the individual words. When he adds the word AND into his search query, he created what is known as a Boolean search. This means a search contains the words AND, OR, or NOT, and tells the search engine more about what you mean when you enter your terms. In Attorney Smith’s case, this means his result set will contain documents that have both the word motorcycle in them as well as the phrase “maintenance records”.
The documents in this result set have maintenance records for all sorts of different kinds of motorcycles, so Attorney Smith changes his search terms to be more specific. He also limits his search so that the search will only run against the current subset of the documents. This is called a subset search. This time he enters:
“BMW 1150” AND brakes
Attorney Smith realizes from looking at previous documents that entering just “brakes” won’t cover enough options, so he turns on the stemming option and runs his search again. Having stemming turned on means that not only documents containing the word brakes will match his search, but also documents containing brake, braking, and braked.
Attorney Smith is pleased with the results he obtains from this search, but he soon realizes that he doesn’t see any of the actual maintenance records from his client’s garage. These were paper copies, so they were first scanned, then run through an OCR process to convert the text into a machine-readable form, before being loaded into the database for searching. The problem with OCR text taken from a scanned document is that sometimes the OCR misses a letter or two, so searches might not match.
Attorney Smith runs his search again, this time turning on the fuzzy search option. Fuzzy search allows a search term to match other terms that don’t match exactly, but might be off by a letter or two. This way, if the word brakes was misread as brokes or even blokes, it would still match the search term.
Finally, Attorney Smith has a collection of documents that contains what he needs, and he is ready to prepare review sets for his team to begin reviewing. However, he decides to run one more type of search just to be sure. This search is called a concept search, and it uses statistics to match not just exact keywords but to match concepts that are similar to the keywords entered. Concept search can also be used without keywords, by simply finding all the major concepts and how they are clustered within the set of documents.
This time Attorney Smith runs a concept search using the keywords BMW 1150, brakes, accident, and maintenance. As he scrolls through the results he doesn’t see anything new, until he sees the word stoppies, which he is unfamiliar with. A little digging in the result set of documents lets him discover that stoppies is a behavior similar to wheelies that can result in damaged brakes. The documents containing this word revealed that the plaintiff frequently engaged in this dangerous behavior. Attorney Smith now had the ammunition he needed to win his case, using a concept he did not know in advance existed.
6.1. Keyword Search
A keyword search is a basic search technique that involves searching for one or more words within a collection of documents. Typically, a keyword search involves a user typing their search request, or query, into a search engine such as Google, which then returns only those documents that contain the search terms entered. The documents returned by the search engine are called the search results.
6.1.1. Guidelines
Keyword searches are most often used to identify documents that are either responsive or privileged. It is also widely used for large-scale culling and filtering of documents. Keywords often form a basic building block for constructing other more complex compound searches. Such compound searches use other search elements such as Boolean logic.
6.1.2. Normal Parameters
Keyword search normal parameters are:
- The syntax in the search string;
- Use of the keywords with or without stemming;
- Use of keywords with certain wildcard specifications and the syntax for said wildcards;
- Case-sensitivity of keywords used in searches and whether the keyword should match both cases; and
- The target data sources to be searched.
- Whether the query can be applied to any specific fields such as email ‘To/From’ or ‘Subject’.
- Whether the query can be applied to any specific date range such as an email ‘Sent Date’ between the date range of January 1, 2001 through December 31, 2001.
6.1.3. Assumed Parameters
The examples used in the Search Guide are all given in standard U.S. English. However, some documents may be written in non-English languages or character sets, and the following parameters may need to be specified. These are discussed in further detail in Section 7.
- The character encoding of the text – UTF-8, UTF-16, CP1252, Unicode/WideChar etc.
- Language of the keyword, to select appropriate stemming.
- Any special handling of characters such as diacritics, accents etc.
- If de-compounding of the keyword needs to be performed, usually when working with languages such as German.
- If there is a set of special characters, what the special characters are, and how an escape character is specified.
- If there is a tokenization scheme present, what the token delimiters are, and the impact of tokenization on searchability of documents. Searches may not be precise when these tokenization characters are present in the keyword.
6.1.4. Phrase Search
Keyword searches also encompass phrase searches. Phrase searches can be specified by enclosing the words in the phrase between two quotation marks (“).
The following should be considered when conducting a phrase search:
- Phrases that contain a double-quote in any of its keywords require escaping of the double-quote. Example: to search for the quoted expression:
He said: “don’t do it”
the query would be:“He said: \”don’t do it\””
- If there are noise words (see below), they will be specified in the phrase. However, the searching implementation may substitute any word in place of the noise word. Therefore, it is very important for the legal practitioner to validate the behavior of phrase searches that include noise words.
- If the phrase includes special words (such as the Boolean Operator “AND”, they must be enclosed within double-quotes so they will be interpreted as part of the phrase and not as an operator.
6.1.5. Wildcard Specifications
A legal professional can specify a single character wildcard or a multiple-character wildcard in the following ways:
Wildcard type | Syntax | Description |
---|---|---|
Single-character wildcard | x?y | matches all strings that begin with substring x, end with substring y, and have exactly one-character in between x and y |
Multiple-character wildcard | x*y | matches all strings that begin with substring x, end with substring y, and have 0 or more characters between x and y |
If a keyword contains wildcards, an escaping mechanism is needed to search. To escape, the following syntax should be used:
x\?y
For example: to search for
“How are you?”
nthe search string would be:
“How are you\?”
Availability of multi-character wildcards may be limited in some systems. Some search engines require a certain number of leading characters and do not support search terms that start with a wildcard.\
6.1.6. Truncation Specification
Truncation specification is one way to match word variations. Truncation allows for the final few characters to be left unspecified. Truncation is specified using the following syntax:
Syntax | Description |
---|---|
x! | matches all strings that begin with substring x. |
!x | matches all strings that end with substring x. |
“x! y” | when specified within a phrase, the truncated match on the words with !, and exact match on the others. |
6.1.7. Stemming Specifications
Stemming specification is another method for matching word variations. Stemming is the process of finding the root form of a word. The stemming specification will match all morphological inflections of the word, so that if you enter the search term sing, the stemming matches would include singing, sang, and song. Note that even though a stemming search will return singing for a search term of sing, this is different from wildcard search. A wildcard search for sing* will not return sang or song, while it will return Singsing.
Syntax | Description |
---|---|
x~ | matches all morphological variations (inflections) of the word. Exactly how a search implementation identifies these inflections is not specified. |
“x~ y” | when specified within a phrase, the stemming variations match on the words with ~, and exact match on the others. |
6.1.8. Fuzzy Search
Fuzzy search allows searching for word variations such as in the case of misspellings. Typically, such searching includes some form of distance and score computations between the specified word and the words in the corpus.
Fuzzy search is specified using the operator: fuzzy-search.
Syntax | Description |
---|---|
fuzzy-search(x,s) | For the search word x, find fuzzy variations that are within the score s. The score is specified as a value from 0.0 to 1.0, with values closer to 1.0 being a closer match. The word itself, if present, will match with a score of 1.0. |
fuzzy-search(x,s,n) | For the search word x, find fuzzy variations that are within the score s and limit the results to the top n by score. |
Fuzzy search may be combined with other search constructs, and be included as part of a phrase or Boolean constructs.
6.1.9. Errors to Avoid
Some caveats when using keyword searches are:
- Stemming may cause additional unintended keyword matches.
- Wildcard expansions may cause results to be overly broad.
- If tokenization is based on certain text characters being interpreted as delimiters, they may not be searchable as a keyword. Consider using a phrase as a search.
- Case-sensitivity may need to be considered carefully.
- If a word in a document contains a hyphen and the keyword matches any or all of the hyphenated word, depending on how the document is indexed the hyphen may prevent a match. For example, if the keyword is known and the document contains well-known, there is a chance that the search engine will not recognize the two as a match.
- If the document is structured as a compound document (i.e., has multiple sections such as Title, Body etc.), keyword-based searches should be performed with care.
6.2. Boolean Search
Boolean searches are used to combine results of multiple searches as well as to designate ambiguity, as when search for two or more terms but do not necessarily need both. They are specified using Boolean operators as shown below:
Operator | Description |
---|---|
AND | This is specified between two keywords and/or phrases, and specifies that both of the items be present for the expression to match. |
OR | This is specified between two keywords and/or phrases, and specifies that either of the two items be present for the expression to match. |
NOT | Negates the truth value of the expression specified after the “NOT” operator. |
NOT w/n | Specifies that the terms and/or phrases to the right of the w/n specification must not be present within the specified number of words. |
ANDANY | This is specified between two keywords and/or phrases, and specifies that items following the “ANDANY” operator are optional. |
w/n | Connects keywords and/or phrases by using a nearness or proximity specification. The specification states that the two words and/or phrases are within n words of each other, and the two words/phrases can be in either order. NOTE: the specified number of words implies that there are n-1 intervening other words between the two. “Noise words” are counted in the specification. |
pre/n | Connects keywords and/or phrases by using a nearness or proximity specification. The specification states that the two words and/or phrases are within n words of each other, and the order of the words is important. |
w/para | The two keywords and/or phrases are found within the same paragraph, and order is not important. |
pre/para | The two keywords and/or phrases are found within the same paragraph, and order is important. |
w/sent | The two keywords and/or phrases are found within the same sentence, and order is not important. |
pre/sent | The two keywords and/or phrases are found within the same sentence, and order is important. |
start/n | The keyword/phrase is present at the start of the document or section, within n words of the start. |
end/n | The keyword/phrase is present at the end of the document or section, within n words of the end. |
6.2.1. Errors to avoid
There are several issues to consider when using Boolean Search.
6.2.1.1. Evaluation Ambiguity
Ambiguous evaluation of operators occurs when the operators are specified without understanding the order of evaluation. For example,
owners AND dogs OR cats
If the intent is to find documents containing pet-owners and either dogs or cats, the above search string could produce inaccurate or unexpected results. To avoid this, use grouping operators, that is, use parenthesizes to enclose search terms that should be evaluated at the same time:
owners AND (dogs OR cats)
6.2.1.2. Effect of Document Segments
In some situations, a document is split into multiple segments (such as Abstract, Body, Title, References, Citation, etc.). In these situations, the Boolean operators may be limited to a specific document segment. In these situations, you may need to specify the search scope of the document.
6.2.1.3. Evaluation Order
Although the evaluation order should be immaterial, you may find that some search engines produce different results if the order is specified differently. As an example, “cats AND dogs” should produce the same results as “dogs AND cats”. In other implementations, the performance of search is impacted by the order of specification. As an example, “owners AND (cats OR dogs)” performs better (produces search results faster) than “(cats OR dogs) AND owners”.
6.2.1.4. Boolean Operators as Keywords
When Boolean Search operators are themselves searched either as a keyword or as part of a phrase, care should be taken to avoid them being interpreted as Search Strings. To specify keywords that would otherwise be interpreted as a Boolean Search Operator, the keywords can be enclosed within double quotes.
Example: to search whether a document contains the string w/5 within it, specify the query by “w/5”.
6.3. Grouping
Grouping is used to specify the precise order of evaluation of Boolean Search Constructs. This is achieved using parenthesized constructs as shown below:
Syntax | Description |
---|---|
((A OR B) AND (B OR C)) | Grouping by parenthesis allows individual expressions to be evaluated per the parenthesis. |
The only grouping characters are ‘(‘ and ‘)’. These are meta-characters and must be escaped if they should be searched for. As an example, to search for the phrase: Contract (Sales Department)
specify: Contract \(Sales Department\)
6.4. Synonym Search
Synonyms are word variations that are determined to be synonyms of the word being searched. Such searching includes some form of dictionary or thesaurus based lookup.
Synonym search is specified using the operator: synonym-search.
Syntax | Description |
---|---|
synonym-search(x) | For the search word x, find synonym variations. |
Synonym search may be combined with other search constructs. Synonym search may also be included as part of a phrase or Boolean construct.
6.5. Related Words Search
Related words search allows a legal professional to specify a word and other words that are deemed to be related to it. Typically, such related words are determined as either part of concept search or by statistical co-occurrence with other words.
Related word search is specified using the operator related-word-search.
Syntax | Description |
---|---|
related-word-search(x) | For the search word x, find other related words. |
Related words search may be combined with other search constructs, and be included as part of a phrase, or Boolean constructs.
6.6. Concept Search
Concept search allows a legal professional to specify a concept and documents that describe that concept to be returned as the search results. It can be a useful technique to identify potentially relevant documents when a set of keywords are not known in advance. Concept search solutions rely on sophisticated algorithms to evaluate whether a certain set of documents match a concept. There are three broad categories of concept search that a legal practitioner may need to understand and evaluate its applicability.
6.6.1. Latent Semantic Indexing
Latent semantic indexing (sometimes also referred to as Latent Semantic Analysis[1. S. Deerwester, Susan Dumais, G.W. Furnas, T.K. Lansauer, R. Harshman (1990), “Indexing by Latent Semantic Analysis”, Journal of the American Society for Information Science 41 (6): 391–407.)] is a technology that analyzes co-occurrence of keyword terms in the document collection. In textual documents, keywords exhibit polysemy (which refers to a single keyword having multiple meanings) as well as synonymy (which refers to multiple words having the same meaning). An additional factor is certain keywords are related to the concept in that they appear together. These relationships can be “is-a” relationship such as “motorcycle is a vehicle” or a containment relationship such as “wheels of a motorcycle”.
In the “motorcycle” example above, documents may contain helmets, safety and brakes but not the word motorcycle. Additionally, the same document may also contain references to insurance, accident, the rider’s name and a geographical location. Intuitively, a reader of the document may relate to this as a document on motorcycles, based on certain relevant terms while ignoring the presence of other irrelevant words. Latent semantic indexing understands the co-occurrence of these words, while reducing/eliminating the impact of other unrelated words.
6.6.2. Text Clustering
Text clustering is a technology that analyzes a document collection and organizes the documents into clusters.[2. For a review of various text clustering approaches, see Nicholas O. Andrews and Edward A. Fox, Recent Developments in Document Clustering, October 16, 2007, Deparment of Computer Science, Blacksburg, VA.] This clustering is usually based on finding documents that are similar to each other based on words contained within it (such as noun phrases). Text clustering establishes a notion of “distance between documents” and attempts to select enough documents into the cluster so as to minimize the overall pair-wise distance among all pairs of documents. In the process, new clusters are created from documents that may not belong to a cluster.
6.6.3. Bayesian Classifier
Bayesian classifier[3. Naive Bayes Classifier and its use in Document Classification, http://en.wikipedia.org/wiki/Naive_Bayes_classifier.] is a process of identifying concepts using a certain representative documents in a particular category. As an example, one may select a small sample of responsive documents and feed them to a Bayesian classifier. The classifier then has the ability to discern other responsive documents in the larger collection and place them in a category. Typically, a category is represented by a collection of words and their frequency of occurrence within the document. The probability that a document belongs to a category is based on the product of the each word of the document appearing in that category across all documents. Thus, the learning classifier is able to apply words present in a sample category and apply that knowledge to other new documents. In the e-discovery context, such classifier can quickly place documents into confidential, privileged, responsive documents and other well-known categories.
6.6.4. Concept Search Specification
Effectiveness of concept search in an e-discovery project depends greatly on the type of algorithm used and its implementation. Given multiple different technologies, the EDRM Search specification proposes that a concept search was used for fulfilling a search request and a registered concept-search implementation/algorithm was used, and an identifier (name) of the concept that was used in the search.
Concept search is specified using the operator concept-search.
Syntax | Description |
---|---|
concept-search(concept-implementation, x, vendor-param-1, vendor-param-2, …) | Given a concept x and concept-implementation, locate all documents that belong or describe that concept. Some vendor implementations may require additional parameters. |
To indicate the type of concept-implementation, concept search vendors are encouraged to register their implementation name. It is not required to disclose the internal algorithms the vendor utilizes to implement the search. Concept search may be combined with other search constructs, and also be included as part of a phrase or Boolean clause.
6.7. Occurrence Count
Occurrence count search allows a legal professional to specify that a word appear a certain number of times for the document to be selected.
Occurrence count search is specified using the operator occurs.
Syntax | Description |
---|---|
occurs(x,n) | For the search word x, count the number of times it appears, and select the document if the specified occurrence count is matched. |
6.7.1. Diacritic Specification
For languages that include diacritic characters on certain characters (such as vowels), specifying whether the diacritics should match is a search option.
Syntax | Description |
---|---|
diacritic-sensitive(x) | All the characters must match in their specific diacritic marks. |
diacritic-insensitive(x) | Diacritics are not considered in evaluating matches. (This is the default) |
6.8. Searching for Parameters
Parameterized search allows searching to be based not on keywords but on certain parameters, such as a document’s metadata. Parameterized search is also known as fielded search, because it is frequently performed on data stored within the fields of a database table.
6.8.1. Searching within a Date Range
Date range search allows a legal professional to search a document’s metadata to find search results where the creation dates, access dates, or modification dates of documents fall within a specified range of dates. Email usually has an associated To: and From: date, and electronic documents have metadata for the dates they were created, last accessed, and so on. The list of metadata for dates can be found in Appendix 1.
Sometimes there is no information for some of the date fields available for a document, such as when only a creation date exists for no date information for when the document was last modified. Date range searches can be open-ended, where, for example, the search can be for all documents created before or after the date given in the search term. Date ranges can be specified using the following syntax:
Syntax | Description |
---|---|
= date-string | The exact date is matched |
> date-string | The document date is greater than (i.e., after) the date in the date-string |
> date-string | The document date is less than (i.e., before) the date in the date-string |
Date-string is specified using ISO 8601 standards, which is also adopted by the W3C organization – (http://www.w3.org/TR/NOTE-datetime) in the following way: YYYY-MM-DDThh:mm:ssTZD where:
YYYY = four-digit year MM = two-digit month (01=January, etc.) DD = two-digit day of month (01 through 31) hh = two digits of hour (00 through 23) (am/pm NOT allowed) mm = two digits of minute (00 through 59) ss = two digits of second (00 through 59) s = one or more digits representing a decimal fraction of a second TZD = time zone designator (Z or +hh:mm or -hh:mm
6.8.2. Searching for Metadata
Metadata search allows searching to be constrained based on certain metadata elements of a document. A general search specification allows for naming the metadata fields, specifying the inherent type of that metadata, and the value to search for. Metadata is specified using the operator metadata-search.
Syntax | Description |
---|---|
metadata-search(Name, Type, Start-value, End-value) | Metadata search specifies the name of the metadata using the Name attribute. The Type specifies the type of the values (see section of Type Specification). Also, to specify the matching criteria, the specification includes a Start-Value and an End-Value. A range specification is useful since metadata is typically a numeric valued item. |
For a list of named metadata properties see Appendix 1.
6.8.3. Searching for Custodian
Custodian search is a common form of constraining search results. To search based on a custodian, the metadata search using the metadata name “Custodian” can be used. Custodian search may rely on assigning custodians to collected data during the Identification Phase so that searching doesn’t miss out on custodians. For example, instant messages with buddy-names may be missed if the search term is specified as last-name/first-name or as email addresses.
6.8.4. Searching for Tags
Tag searching allows for specific tags, such as a Batch Label or a Document Tag like “Responsive” or “Privilege”, allows the legal professional to filter and constrain results in useful ways. To specify these searches, the following syntax is helpful:
Syntax | Description |
---|---|
tag-search(Tag-Name, Tag-Value) | Tag-Name is any user specified string, and Tag-Value is the current value of the tag. |