7. Search Engines and Searching Considerations

Automated searching of ESI requires the specification of a search request in the form of a query. Typically, a search request is issued and the ESI examined to find documents that match the request. Automated search solutions called search engines employ different techniques for accomplishing this. While an extensive discussion of these the mechanics of search engines is beyond the scope of this guide, we provide an overview of commonly used techniques.

7.1. Search Engines

Search engines are a category of information retrieval systems designed to supply a subset of items from a population based on a specified set of criteria. The structure and function of search engines are largely determined by:

The nature of the information that is the target of a search;
The methods used to distinguish a subset of information from a larger population and
The interface used to conduct search queries and to display query results

Most familiar search engines explore the dynamically changing and widely varied information accessible via World Wide Web. In the field of e-discovery, however, the population of information is often static and determined by the scope of electronic data secured during the discovery process. As a result, search engines used in e-discovery can conduct a comprehensive examination of a population of data. Indeed, one of the main requirements of a search engine applied to e-discovery should be the ability to identify all documents and data files that are responsive to a specific set of search criteria.

7.1.1. Search

The defining features of a search engine are the methods by which information about a population of data – metadata — is collected, stored, and examined to identify a subset of interest. In the scope of e-discovery, the purpose of collecting metadata is to generate a compact and easy to manipulate set of information about a population of electronic data. In addition, the type of metadata collected is intended to increase the efficiency of identifying individual documents based on their unique features or categorizing documents based on common features. To support these goals metadata can take many forms including the type of information contained in individual data files (e.g. spreadsheet, text or images), the relationship between individual data files, information about the origin or source of the data, the dates for the creation and modification of the information within a file or an index of keywords contained in a document. Metadata is stored using a wide variety of methods which can include an index of document contents, a database of document characteristics, or a veridical representation of document contents (i.e. caches files). The functional feature of a search engine is the algorithm used to query the metadata and organize the results of a query. The methods used for searching and representing search results are often proprietary and zealously guarded trade secrets of the organizations that develop and offer search services. Great effort is placed in optimizing the speed, accuracy, comprehensiveness and relevance of electronic searches based on the type of information that is of interest to the user.

The user interface is the most visible aspect of a search engine. User interfaces are responsible for both the collection of information relevant to conducting a search and the representation of search results. The type of information a search engine prompts a user to enter is often a reflection of the metadata and search methods it utilizes. For example, search engines that rely on a full text or content searches of cached and indexed data would be designed to prompt a user for unique key words, keyword combinations/arrangements or strings of text that are found within a population of documents. In turn, the representation the results from this type of search algorithm would place an emphasis on organizing documents according to their unique relevance to the entered search terms. In contrast, a search engine that relies on a Boolean search of a document database would prompt the user for specific search logic that discretely identifies documents. The search terms and search logic are in turn based on metadata or data fields coded for each document in a population of data. The results from this type of search would more likely emphasize a comprehensive listing of the search results since the typical goal is to identify a subset of documents based on their common features that satisfy the specified search requirements. Overall, the most effective user interfaces are characterized by how well they balance the availability of a variety of search methods and search results representations according to the range of searches a user will to conduct.

In summary, search engines aid the process of electronic document discovery by providing a set of tools to both uniquely identify and categorize documents. While the overall capability of search engines are constrained by the distinguishing characteristics of electronic data, a wide variety of data collection, data organization and data query methods are employed to assist users in identifying documents of interest.

7.1.2. Fielded Search (Searching Metadata)

Fielded searches are based on values stored as metadata rather than actual content of an electronic asset. Searches can be refined using metadata information extracted during processing, such as sender or receiver, creation date, modified date, author, file type and title, as well as subjective user-defined values that may be ascribed to a document as part of downstream review. Examples of subjective field values could include designations for relevance, confidentiality, and custodial ownership.Coded values can be used to both add distinguishing information to a document or to code document features that can be used to categorize a set of documents. In this way, field values can be used individually or in combination to create highly specific results sets.

The effectiveness of a fielded search is predicated on the degree to which the query has been refined. Most fielded searches can be refined by using operators to join combinations of fields, as well as operators that expand or limit queried field values. For example, Boolean operators can be used to influence the inclusivity or exclusivity of a search query. A query using an OR operator between a specific file type and a specific author (e.g. Microsoft Excel OR Author A) will generate a far more inclusive set of results than a query for (Microsoft Excel AND Author A). The use of ‘OR’ would, however, be highly effective when searching for a single file type from a number of authors – (Author A OR Author B OR Author C) AND Microsoft Excel.

Whereas Boolean operators can be used to join fields, other operators such as EQUALS, BETWEEN, CONTAINS and LIKE are used to refine field value searching. For example, a search using EQUALS will return only exact matches for a specific field value – ‘Title’ EQUALS ‘Second Quarter Sales Report’. A reviewer may use the operators LIKE or CONTAINS to construct a more inclusive query for a broader set of field values. The operator LIKE runs what is termed as a ‘fuzzy search’ which returns approximate matches of the queried field value – Title LIKE ‘2nd Quarter Sales Report’. ‘LIKE’ is useful because it will allow the retention of search terms despite any spelling errors or other variations in the naming scheme found in a document. ‘CONTAINS’ will return any file where the specified keyword appears as part of the field value. The search ‘Title’ CONTAINS ‘Report’ will return ‘Second Quarter Sales Report’ in addition to any other document with the word ‘Report’ in the title. This approach can be a very useful approach when the specific name of a document is unknown.

7.1.3. SQL and Other Query Languages

7.1.3.1. What is SQL?

SQL is an acronym for Structured Query Language. In the scope of e-discovery, SQL provides a syntactical framework that is used to input, manage and query the information stored in a relational database management system (RDBMS), typically referred to as ‘structured content’ as opposed to unstructured or semi-structured content. To gain an understanding of what SQL is and how it is used by some document management systems, it is helpful to first have an insight into the nature of the relational databases it operates upon.

Relational databases provide a logical structure that both stores varying types of information (i.e. data) and represents the relationships between distinct pieces of information. In SQL, data is classified according to three levels: Class → Category → Data Type. Some examples of classes include data representing numeric values, ASCII or Unicode characters, or chronologic values. Categories under each of these classes separate data according to operational features. Some data features include varying or static lengths for strings of characters, decimal or integer values for numeric data and date or time expressions for chronological data. Finally, data types provide a finer level of granularity that specifies the exact format that data is stored in the database.

Relationships in a relational database are represented by linkages that exist between two or more pieces of data. For example, in the table below each row represents the distinct features associated to a single person and each column represents features that are coded for each individual. In this way, data that is unique to an individual such as age, name and birth date will always be linked in the database. In turn, each entity – in this case a person – that is entered into this table is linked to a set of characteristics specified by the structure of the database.

ID	First Name	Last Name	Age	Birth Date	Blood Type ID
1	Jim	Smith	5	2003-12-01	2
2	Cindy	Walker	21	1987-12-28	2
3	Mark	Bush	72	1936-08-19	8
4	Peter	Hamden	34	1974-03-26	5
5	Rebecca	Larson	31	1977-04-27	6

The first defining characteristic of SQL is illustrated by how the language is used to create a relational structure among data types and enter specific values for each entity represented in the table. For example, the following statement is used by a SQL based RDBMS to create the table illustrated above:

/* * Create a table to hold various patient data */

CREATE TABLE [dbo].[Patient_Data]( [Key] [bigint] NULL, [First Name] [varchar](50) NULL, [Last Name] [varchar](50) NULL,\r\n [Age] [smallint] NULL, [Birth date] [datetime] NULL, [Blood Type Id] [bigint] NULL

Once the structure is created, SQL provides syntax for populating tables with data:

/* *Insert a record into the Patient_Data table */ insert into dbo.Patient_Data values (1,'Jim','Smith',5,'2003/12/01',2)

The second defining feature of SQL is its ability to manipulate data within a database. For example, the following statement can be used to increase the age by one year for all the people listed in the table.

/** Increase all ages by one year*/ update Patient_Data set age = age + 1

The final defining feature of SQL is its ability to return data from one data field based on its relationship with another data field. For example, the following query will return the first and last name of individuals listed in the table above based on their age.

/** Retrieve First Name and Last Name of all * individuals aged five years or older.*/ select [First Name], [Last Name] from Patient_Data where age >= 5

In the field of e-discovery, RBDMS are used because of their ability to store large amounts of data in a compact format without losing any information. For example, in the table above, blood type is represented by a single ID number. The following SQL query can be used in combination with the table below to retrieve each person’s blood type.

Blood Type ID	Blood Type
1	O Negative
2	O Positive
3	A Negative
4	A Positive
5	B Negative
6	B Positive
7	AB Negative
8	AB Positive

/** Retrieve First Name, Last Name and blood type for all * individuals in the database ** Note: A and B are table aliases which are used to * save retyping table names.*/

select [First Name], [Last Name], [Blood Type] from Patient_Data “A” inner join Blood_Types “B” on A.[Blood Type ID] = B.[Blood Type ID]

In this example the single numeric value “8” can represent a string of 11 characters “AB NEGATIVE”. While the savings in data storage may not be apparent in this example, it is possible to see when considering a table listing information for millions of people. Specifically, if an integer only takes up 4 bytes of storage space vs. 20 bytes for a character string the potential storage savings over 1 million entries, for a single column in the table, is 16 megabytes. Savings in storage space also often translate to faster completion of search queries. Furthermore, minimizing the amount of redundancy has the added benefit of simplifying data management. Continuing with the example, a typographical error during the process of entering blood types would only need to be fixed in one location when data is split across two tables. This is clearly less time-consuming than attempting to fix 1 million patient records where the blood type is stored along with personal information.

To perform searches on large databases with adequate performance it is important to create the proper indexes. This includes columns where JOINS are typically performed. In addition to the relational data, databases support indexing of data based on data type and are often efficient at sorting information that is indexed, e.g. a date column. Some databases also have full text indexing capabilities built-in.

The true power of SQL becomes evident when important facts and trends are hidden among massive amounts of data. The combination of SQL and a powerful RDBMS platform allow data to be represented, manipulated, classified, and summarized in a standard and robust manner. SQL can also be used to ingest different datasets to determine overlap or disjointedness. SQL provides a rich and diverse set of tools for handling many data-related tasks.

7.1.4. Indexing

Indexing is a process that inventories the total content of a file. The end result is similar to the index at the end of a text book. Without an index, the process of searching for a specific word or phrase would involve a page-by-page review of the text for each new query. An index allows a reader to quickly and efficiently locate pages containing a specific term or phrase. Search indexes serve precisely the same function as tools designed to facilitate and expedite the retrieval of information.

As data is indexed, the content is scanned to identify unique terms and term locations within the text, and perform additional functions against the data, such as stemming algorithms, ranking, natural language or conceptual modeling. This information is then retained in a database or structured search catalogue that is specific to the search engine used, and can be queried using any standard structured query language. Querying an index can involve a simple search for a single keyword, or a more complex query involving multiple keywords and proximity restrictions. The syntax and format for queries depend on the conventions used by a search engine or the database language used to query the indexed data.

Search engines will use both common and proprietary technology to build indexes and service search queries.

For the most part, it is not practical to index every term in a file. Certain words are so commonplace as to offer little or no value when conducting a content search. To avoid creating an overly inclusive index, most indices utilize a noise word filter. This filter includes a customized list of terms that are overlooked during indexing. Each index has a list of ignored noise words that can be customized. Some common noise words include ‘a’, ‘and’, ‘the’, ‘from’, and ‘because’.

7.1.4.1. Term Selection

Term selection determines what terms in a document are indexed and findable through the search engine. Some search engines use complete term indexing covering all terms while others use partial term indexing, with differing techniques to eliminate some terms before indexing by the search engine. Terms that are eliminated before indexing cannot be searched on so care must be used when eliminating terms to index. Complete and partial term indexes have different advantages and disadvantages which are described below.

Organizations procuring search technologies should ask what type of term selection is used. If noise words (black lists) or pre-selected words (white lists) are used, the lists should be available to the users and may need to be delivered to the opposing party for increased transparency.

Term Selection	Description
Complete Term Indexing (index of all terms)	All Terms Can Be Searched Complete term indexes cover all source terms and are the most comprehensive method of term selection. They ensure that every term in the files can be successfully searched, eliminating the problem of false negatives and additional noise which can occur when words are removed from the indexing process. Accurate Search Indexing all terms provides the most accurate search and reduces both false negatives and false positives that can exist with other approaches. False negatives are reduced or eliminated by indexing all terms because searching on any term will find documents that have that term in them. No terms are removed to prevent accurate search. False positives are reduced because term elimination also removes terms from search queries. For example, searching on the phrase “vitamin a” where the word “a” was eliminated from the index and query would return all documents with the word “vitamin” no matter what vitamin was being discussed.
Partial Term Indexing Using Noise or Stop Words (black list)	Noise or Stop Words Noise words, also known as stop words, are typically common words that some search engines use as a black list for term removal when creating the search index. The reasoning behind this is that some search engines consider some terms to have little or no value and choose not to index them, hence the name noise words. Some common noise words include ‘a’, ‘and’, ‘the’, ‘from’, and ‘because.’ Noise words vary by language so search engines using noise words must correctly identify the language being indexed and select the appropriate black list. Products that use noise words often make the black lists available to their users so one can be informed of what words are not indexed and cannot be searched. Reduced Information (False Negatives) A significant issue with noise words occurs when you want to find one or more words that have been removed from indexing and thus cannot perform the search you want. A well-known example of this is the phrase “to be or not to be.” By themselves, each word in often listed in a noise or stop word list but, together, they are obviously meaningful. Traditional use of noise words could render this phrase unfindable using the search engine. Other examples where meaningful information may be eliminated include “vitamin a,” “stock symbol k,” “C++,” etc. Increased Noise (False Positives) When noise words are used, they are eliminated from not only the search index but also the search query. Automatically eliminating words from the query can return documents that one was not expecting to receive, producing false positive results. For example, searching for “vitamin a” may have the “a” removed, returning all documents with the word “vitamin.”
Partial Term Indexing Using Pre-Approved Words (white list)	Pre-Approved Words The use of pre-approved words uses a pre-determined list of all words that will be indexed. This list functions as a term white list. When white lists are used, words that are not on the list are typically not indexed and cannot be searched on. While this type of indexing typically provides the least search capability, it is used by some popular products so it is important to understand the characteristics of this approach. It generally provides the smallest index; however, it achieves that by eliminating all words from the index that are not on its list. Reduced Information (False Negatives) Partial term indexes using pre-approved words can dramatically reduce the amount of information indexed, much more than the use of noise words. While noise words remove information that is often, but not always, of little use, pre-approved white lists may miss many important words. For example, if a search index was created without pre-approving the name ‘Katrina’ or the word ‘Hurricane’ searches for ‘Katrina’ and ‘Hurricane Katrina’ would not return any results. These words will often not be missed using noise words, but it is likely they would be eliminated using pre-approved white lists. Increased Noise (False Positives) Pre-approved white lists can result in more false positives and documents for review than either complete term indexes or nois word approaches.

Term Selection

Description

Complete Term Indexing (index of all terms)

All Terms Can Be Searched

Complete term indexes cover all source terms and are the most comprehensive method of term selection. They ensure that every term in the files can be successfully searched, eliminating the problem of false negatives and additional noise which can occur when words are removed from the indexing process.

Accurate Search

Indexing all terms provides the most accurate search and reduces both false negatives and false positives that can exist with other approaches.

False negatives are reduced or eliminated by indexing all terms because searching on any term will find documents that have that term in them. No terms are removed to prevent accurate search.

False positives are reduced because term elimination also removes terms from search queries. For example, searching on the phrase “vitamin a” where the word “a” was eliminated from the index and query would return all documents with the word “vitamin” no matter what vitamin was being discussed.

Partial Term Indexing Using Noise or Stop Words (black list)

Noise or Stop Words

Noise words, also known as stop words, are typically common words that some search engines use as a black list for term removal when creating the search index. The reasoning behind this is that some search engines consider some terms to have little or no value and choose not to index them, hence the name noise words. Some common noise words include ‘a’, ‘and’, ‘the’, ‘from’, and ‘because.’ Noise words vary by language so search engines using noise words must correctly identify the language being indexed and select the appropriate black list.

Products that use noise words often make the black lists available to their users so one can be informed of what words are not indexed and cannot be searched.

Reduced Information (False Negatives)

A significant issue with noise words occurs when you want to find one or more words that have been removed from indexing and thus cannot perform the search you want.

A well-known example of this is the phrase “to be or not to be.” By themselves, each word in often listed in a noise or stop word list but, together, they are obviously meaningful. Traditional use of noise words could render this phrase unfindable using the search engine.

Other examples where meaningful information may be eliminated include “vitamin a,” “stock symbol k,” “C++,” etc.

Increased Noise (False Positives)

When noise words are used, they are eliminated from not only the search index but also the search query. Automatically eliminating words from the query can return documents that one was not expecting to receive, producing false positive results. For example, searching for “vitamin a” may have the “a” removed, returning all documents with the word “vitamin.”

Partial Term Indexing Using Pre-Approved Words (white list)

Pre-Approved Words

The use of pre-approved words uses a pre-determined list of all words that will be indexed. This list functions as a term white list. When white lists are used, words that are not on the list are typically not indexed and cannot be searched on.

While this type of indexing typically provides the least search capability, it is used by some popular products so it is important to understand the characteristics of this approach. It generally provides the smallest index; however, it achieves that by eliminating all words from the index that are not on its list.

Reduced Information (False Negatives)

Partial term indexes using pre-approved words can dramatically reduce the amount of information indexed, much more than the use of noise words. While noise words remove information that is often, but not always, of little use, pre-approved white lists may miss many important words.

For example, if a search index was created without pre-approving the name ‘Katrina’ or the word ‘Hurricane’ searches for ‘Katrina’ and ‘Hurricane Katrina’ would not return any results. These words will often not be missed using noise words, but it is likely they would be eliminated using pre-approved white lists.

Increased Noise (False Positives)

Pre-approved white lists can result in more false positives and documents for review than either complete term indexes or nois word approaches.

7.1.4.2. Additional Indexing Customization

Most indices can be customized to better meet specific searching needs. Custom setup preferences can include the sensitivity of upper- and lower-case letters, recognition of specific date formats, the inclusion or exclusion of specific file types, and Unicode compliance to identify and index foreign characters. Each of these customized parameters will influence the manner in which data can be searched, but may also allow for more comprehensive indexing.

Similar to noise word filtering, nearly all punctuation in a document is ignored during indexing. Characters such as periods, quotations, ampersands, tildes, commas and brackets are all indexed as empty spaces. Depending on the search term syntax used by a search engine, specific punctuation may be recognized as a search operator. The inclusion of punctuation as a search operator, however, does not impact the way in which a document is indexed. Derivations of a root term can also be queried by adding other characters such as asterisks, exclamation points and percentage signs. For example, a search for legal* would generate hits for legal, as well as legality, legalities, legalize, etc. Another example would be the addition of a tilde to the end of a word to search for all synonyms of that word.

7.1.5. Streaming Text

The streaming text method of searching does not pre-index the content or metadata, but instead reads the text of a document one keyword at a time, from start to finish Each term is examined, and matched against a the query in real time, if there is a match, the document is considered to be a search hit.

This technique closely mirrors how a human would look for keywords in a printed document. But since automated search using this technique is likely to be limiting, certain variations on the scheme are available. Some known variations on this scheme are:

Place a wildcard character in the keyword of the search query in order to improve the search results by including wildcard matches.
Provide regular expressions matching to expand the search.

However, this technique should be used with care. Typical challenges with this technique are:

The entire content needs to be completely scanned, for each search, causing the search to take very long time.
The process (full scan) is repeated for each consecutive search.
This has the potential to alter the metadata of ESI.
Many ESI types are not searchable, since the data is stored in a form that can not be streamed and searched. As an example, Word documents and Excel spreadsheets are not easily streamed for searching of keywords.
Expressing complex searches or results is difficult.

7.1.6. Data Type Specifications

Electronic information will nearly always exist as a combination of structured data elements and unstructured content. “Structured” content is any overtly labeled element; metadata, or field such as the sender of an email, creation date, or tag applied to a document during review. Databases are designed to manage, organize and automate processes based on these structured content values. Unstructured content, such as the body of an email, or audio file is typically where the majority of relevant information resides, it is not tagged or parsed, so search technologies are used to query the content or an index of the information it contains, and present the results.

Servicing e-discovery will commonly combine fielded (structured) search with the ability to query the unstructured content. Certain fielded search specifications require using a particular data type. Unlike keyword searches on unstructured content where a keyword is a simple string, data types for identifying and applying fielded constraints may require specifying data types. As an example, search for a metadata such as Last-Modified-Time of a document, the search criteria needs to specify values in the form of DATE.

Table of Data Types:

Data Type	Description
INTEGER	Integer Data Type specifies that the value is a simple number.
FLOAT	Float Data Type is used for specifying fractional numbers, using IEEE 754 Floating Point Number definition.
DATE	Date Data Type defines the date values, using ISO 8601 Date Format specifications.
STRING	String Data Type allows specifying string parameters for search, as in keyword searches, but associated with the metadata.

7.2. Language and Conten

As companies increasingly become multi-national, litigation involving non-English electronic documents and e-mail becomes more and more common as well. Computers are able to store each of the different alphabets and character sets for all of the existing languages, but in order to use these non-English languages some technical considerations must be taken into account that can affect how search is performed and what the results are.

7.2.1. Character Encoding Specification

All electronic data is represented as sequences of bits, or numbers. Therefore, each alphabet or script used in a language is mapped to a unique numeric value, or ‘encoded’ for use on a computer using a standard known as Unicode. Each letter or character has been assigned its own unique value in the Unicode encoding schemes, known as the Unicode Transformation Format (UTF). The UTF utilizes multiple encoding schemes, of which the most commonly used are known as UTF-8 and UTF-16. For example, the English alphabet and the more common punctuation marks have been assigned values between 0 and 255, while Tibetan characters have been assigned the values between 3,840 (written as x0F00) and 4,095 (written as x0FFF). All modern (and many historical) scripts are supported by the Unicode Standard. Unicode provides a unique number for every character, regardless of the platform, program, or language. The Unicode Standard is described in detail at the website http://www.unicode.org

When deciding to store and search non-English documents, the following points need to be considered:

The search system needs to be able to support Unicode, since some systems were created to support text encoding schemes which predated Unicode.
Some of the more common non-English languages, particularly Asian languages such as Chinese and Japanese, require two bytes instead of one byte in order to store a single character. A multi-byte encoded document could require twice the storage space of a single-byte encoded document with the same number of characters. This is an important consideration when allocating storage space for multi-byte encoded documents.

7.3. Specifying Case

In general, keyword searches match words in documents without considering whether any or all of the letters in the keyword or the documents are uppercase or lowercase. If these are important for search, the search specification must include them.

Specifying that the search must be case sensitive will match the exact case for all letters in the keyword and in the documents. For example, a case-sensitive search on AIDS will match the word AIDS in the phrase “increased number of cases of AIDS” but won’t match the word aids in the phrase “the nurse aids the operating room surgeon”. Similarly, a case-sensitive search on Rose will match the name “Rose Jones” but won’t match the phrase “rose garden”.

7.3.1. Language Specification

When using a collection of multi-language documents, the collection may contain not only documents written in several different languages, but also single documents that are themselves written in two or more languages. In either case it becomes necessary to specify which language the search terms belong to

For example, if the search term entered is pan, this can mean “bread” in Spanish as well as “pan” in English. Similarly, son can mean “its” in French and “son” in English. Specifying which language the search term is intended to belong to will affect the search results. In a similar vein, the differences between British English and American English can affect the result set if the wrong term is chosen, such as using the American term “trunk” instead of the British term “boot”.

In addition, search engines may have noise words for each supported language. Just as some search engines eliminate high frequency English words, such as a, and, the, a search term that may be meaningful in one language may appear as a noise word in another. Again, specifying the language of the search terms will affect the search results and it is important to get the list of noise words by language used by the search engine being used.

7.3.2. Tokenization Specification

Before a search can occur, a search engine needs to take the text in a document and break the text into searchable keywords. This process is known as tokenization. Tokenization involves identifying special characters such as blank spaces, commas, or periods and using them as separators between words. As an example, if a document contains “cats or dogs”, the tokenization looks for spaces and creates three keywords: “cats”, “or” and “dogs”.

7.3.2.1. Special Cases with Tokenization

1. The Hyphen (-) Character

The following list contains legitimate words:

e-mail editor-in-chief well-respected

Each of these is considered a word, but also contain within them distinct words. Search engines may tokenize each of these as one word, multiple words, or both, depending on the algorithm the search engine uses.

2. Chinese, Japanese, and Korean (collectively CJK) Languages

While most languages use blank spaces or other special characters to indicate breaks between words, a few do not. CJK languages, for example, are written with all characters side by side, and the words can be made up of one, two, or more characters. In addition words can be made up of other words. Context determines how to break the characters into words.

There are two major methods to tokenize CJK languages, N-gram and dictionary-based approaches, each with their own characteristics.

The N-gram method breaks sentences into individual characters regardless of words, and matches each character in the search string with the character in the document. This allows the search engine to index and find all words in the languages with minimal overhead. In this approach, the Japanese word for “system” エンジン would be indexed as エン, ンジ, ジン, ン. By indexing all characters without the need for a-priori knowledge of the language and words, the search engine can guarantee that all terms in the file can be searched on. A consequence of guaranteeing that all words can be found is that the approach can result in a larger search result set than dictionary-based approaches which may not be able to find all words.

The dictionary-based approach uses lists of known words to tokenize CJK languages into proper words. The use of dictionaries can work well to reduce search result with common words when the risk of not finding a document (false negatives) is low. However, dictionaries are only as effective as their word coverage. Dictionaries are maintained by a few organizations and often run in the millions of characters; however, it is acknowledged that this is not enough to cover all words as new general words and proper nouns are constantly being created in these languages. Additionally, dictionaries need to cover many uncommon word variants which exist in CJK languages. For example, the word island which can be represented in Japanese as 嶋 or 嶌 instead of the usual 島. Although it is impossible to have a fully up to date dictionary, dictionary maintainers periodically release updated versions. When a new release of a dictionary is available, it is necessary to re-index the corpus to find the new added words.

In general, the N-gram approach can be relied upon to produce complete search results while providing a larger result set while dictionary-based approaches can reduce the result set so fewer documents need to be reviewed, but by increasing the risk that some files may not be findable. The advantages and limitations of each approach should be understood by the producing and receiving parties.

3. German

Some languages, such as German, are made up of compound words that are not hyphenated to indicate the word breaks. In order to tokenize these languages, the search engines must decompound the long words, breaking them into their components.

4. Diacritical Marks

Some search engines conflate similar characters that differ only by diacritical marks into the one letter, which can cause expanded search results.

7.4. Document Segments and Segment Scope Specification

Documents often contain a certain structure to them, and it may be important to consider document segments and restrict the search specification to a scope. There are a number of considerations here

Scope is for a particular keyword or phrase
Scope is for the entire Boolean expression

Keyword or phrase specific scope is indicated using a prefix in front of the keyword as per the syntax:

Syntax	Description
segment: keyword	Keyword should be present only in the specified segment.
segment: phrase	Phrase should be present only in the specified segment.
(segment: (Boolean-expression))	The entire Boolean expression must match only within the segment for the document to be selected.

E.g., to search for a multiple Boolean in two segments Title and Abstract, a search string of the form: (Title: (report and pets)) AND (Abstract: (pet-owners AND (cats and dogs)))