EDRM Processing Glossary

Term Definition
ASCII American Standard Code for Information Interchange (ASCII) is a plain text character encoding standard where seven- or eight-bit integers correspond to 128 or 256 characters and codes for electronic storage and communication. The eight-bit pairings are often mistakenly referred to as Extended ASCII. The 128 characters  in 7-bit ASCII encoding correspond to 95 printable characters (a-z, A-Z, 0-9 and punctuation) and 33 non-printable control codes, e.g., carriage return, line feed, tab and bell. The 256  ASCII characters enabled by 8-bit integers are  for various purposes, e.g., foreign language characters and line drawing symbols.
Bates Numbers Sequential numeric identifiers imprinted on document pages or assigned to files during the discovery process. Bates Numbers typically include a prefix to identify the producing party or matter as well as a numeric value (e.g., DEF_000000001).
Binary File Signature Also known as “file header signature,” “binary header signature” or “magic number.” Typically the first few bytes of data in a file identifies the format of the data contained therein. For example, ZIP compressed files begin with Hex 504B (or the initials PK in ASCII). Most JPG image files begin with Hex FF D8 FF E0.
Case Normalization Improves search recall by adding information to an index that searches for terms with lowercase characters and identifies its uppercase counterpart and vice versa. For example, a search for Rice will also find instances of RICE and rice.
Character Normalization Seeks to minimize the impact of variations in alphanumeric characters often overlooked by human beings but posing a challenge to machines. This may include Case Normalization, Diacritical Normalization and Unicode Normalization..
Chain of Custody The procedures employed to protect and document the acquisition, handling and storage of evidence to demonstrate these activities did not alter or corrupt evidentiary integrity.
Compression The storage or transmission of data in a reduced size by using technology to eliminate redundancy (“lossless compression”) or by removing non-essential details (such as, picture elements in a JPEG or inaudible components of audio). Compression permits more efficient storage, sometimes at the cost of reduced fidelity (“lossy compression”). ZIP, RAR and TAR are common lossless compression formats in eDiscovery.
Container File A file that holds or transports other files, e.g., compressed container files ( .ZIP and .RAR) and email container files (.PST and .MBOX). Container file content is “unpacked” or “exploded” during processing enabling the container file to be suppressed as immaterial once fully extracted.
Corruption Damage to the integrity of a file that impacts its ability to be processed. File corruption may be caused by, e.g., network transmission errors, software glitches, physical damage to storage media (i.e., bad sectors) or use of an incompatible decoding tool.
Custodian The individuals or entities who hold, or have the right to control, records and information.
DAT File A delimited load file used in conjunction with Concordance-formatted productions. A .DAT file includes a header row of field identifiers that corresponds to  the data that follows. Each field is separated (“delimited”) by a character (“delimiter”) that signals the division of fields.
Deduplication The identification and suppression of identical copies of messages or documents in a data set based upon the items’ hash values or other criteria.
DeNIST The use of hash values to identify, suppress and/or remove commercial software from a data collection. The hash values are maintained by the National Institute of Standards and Technology (NIST) in its National Software Reference Library (NSRL).
Diacritical Normalization Improves search recall by adding to an index terms with diacritics (e.g., accented characters) so as to locate counterparts without diacritics. For example, a search for “résumé” would also locate instances of resume and vice versa.
DTSearch A content extraction, indexing and text search tool licensed to and at the heart of several leading e-discovery and computer forensic tools (e.g., Relativity, LAW, Ringtail (now Nuix Discover) and Access Data’s FTK).
Elasticsearch An open source content extraction, indexing and text search tool used by a number of software providers for indexing and keyword search. It is based on the Lucene open source search engine library project.
Encoding The process of converting electronically stored and transmitted information from one form to another. Character encoding maps alphanumeric characters into numeric values, typically notated as binary or hexadecimal numbers. ASCII and Unicode are examples of character encoding.
Encryption The process of encoding data to unintelligible ciphertext to prevent  access without the proper decryption key (e.g., password).
ESI Electronically Stored Information (ESI) as defined by Federal Rule of Civil Procedure 34(a)(1)(A), includes “writings, drawings, graphs, charts, photographs, sound recordings, images, and other data or data compilations—stored in any medium from which information can be obtained either directly or, if necessary, after translation by the responding party into a reasonably usable form.”
Exception Reporting This process of identifying items which fail during processing. Exceptions may include encrypted files that cannot be read, corrupt files, files in unrecognized formats or languages, and files that require optical character recognition (OCR) for text extraction.
Family Group In the context of an email, a transmitting message (parent object) and its attachments (child objects).
File Header Signature Also known as a “binary header signature,” “binary file signature” or “magic number.” Typically the first few hex bytes of data in a file identifies the format of the data within the file. For example, ZIP compressed files begin with Hex 504B (or the initials PK in ASCII). Most JPG image files begin with Hex FF D8 FF E0.
Filtering The process of culling files from a data set based on characteristics such as, file type, date and size. In e-discovery, files are filtered to suppress multiple copies of the same item (deduplication), irrelevant system files (deNISTing), immaterial container files after content extraction and by lexical search (filtering by keywords).
Forensic Image An exact, verified copy of electronic media. Forensic imaging produces a hash-authenticated, sector-by-sector (“bitstream”) copy of electronic media that can be restored for analysis. This process is typically used to preserve active data, unallocated clusters and file slack space.
Hash A “digital fingerprint” of data or “message digest,” generated by a one-way cryptographic algorithm (e.g., MD5, SHA-1, SHA-256) and recorded as a hexadecimal character string, e.g. 13bfb1528002a68d94249c4ffb09359f. The potential of two different files having matching hash values is so remote that hash value comparisons serve as effective tools for file authentication, file exclusion (DeNISTing) and data deduplication.
Identification In e-discovery, the mechanism by which a processing tool determines the structure and encoding of a file based upon the file’s header signature and filename extension.
IM Instant Message (IM) is a form of real-time text communication over the Internet typically expressed in conversation form. IM can involve communications between two people or larger groups, who sometimes communicate in “rooms.”
Image Format Images initially referred to the output from document scanning but can also refer to files rendered directly from native files. These files are created to  emulate a printed page. In e-discovery, the most common image formats are Tagged Image File Format (TIFF), Portable Document Format (PDF) and JPEG. “Rendering” is the processing step where ESI is converted to image formats.
Index A data structure that improves the speed of search for data retrieval. E-discovery employs full text indexing of processed data to speed search and to reduce storage space.
Ingestion The act of loading data into an application for processing.
Keyword Search term used to query an index or database.
Language Detection Recognition and identification of foreign language content that enables selection of appropriate filters for text extraction, character set selection and diacritical management. Language detection also facilitates assigning foreign language content to native speakers for review.
Load File An ancillary file used  in e-discovery to transmit, system and application metadata, extracted text, Bates numbers and structural information describing the production.  Load files accompany folders holding native, text and image files and provide essential information about the files being transmitted.
Lucene An open source library for content extraction, indexing and text searching used by a number of software providers for indexing and keyword search. Elasticsearch and Solr are based on the Lucene library.
MD5 Message Digest 5 (MD5) is a common cryptographic hash algorithm used for file authentication, file exclusion (DeNISTing) and data deduplication.
Metadata Data describing the characteristics of other data. File metadata may be System Metadata (e.g., file name, size and date last modified, accessed or created are stored outside the file) or Application Metadata (e.g.,last printed date or amount of editing time stored within the file). The term metadata can also include human judgments about a file, e.g. hot or privileged, or information about the file, e.g. from, to, subject, sentdate.
MIME Multipurpose Internet Mail Extensions (MIME) refers to a two-part, hierarchical method of classification for electronic files. MIME Types (also known as  Media Types) classify files within one of ten types: application, audio, image, message, multipart, text, video, font, example and model. Each type is divided into subtypes with sufficient granularity to describe all common variants within the type. For example, the MIME Type of a PDF file is “application/pdf,” a .DOCX file is “application/vnd.openxmlformats-officedocument.wordprocessingml.document,” and a TIFF image file is “image/tiff.” The Internet Assigned Numbers Authority (IANA) is a standards organization that registers new types and subtypes in the MIME Type taxonomy.
Media Type Alternate term for MIME Type, see MIME.
Native Format In the context of software applications, native format refers to the file format which an application creates and uses by design—generally the default, unprocessed format of a file when collected from the original source, e.g., Microsoft Word stores documents as .DOCX files, their native format.
NSRL The National Software Reference Library (NSRL) is maintained by the National Institute of Standards and Technology (NIST), an agency of the U.S. Department of Commerce. The data published by the NSRL (principally hash values of commercial software) is used to rapidly identify and eliminate known files, such as operating system and application files.
Noise Words Common terms purposefully excluded from a searchable index to conserve storage space and improve performance. Also known as  “stop words.”
Normalization The process of reformatting data to a standardized form, such as setting the date and time stamp of files to a uniform time zone or converting all content to the same character encoding. Normalization facilitates search and data organization.
OCR or Optical Character Recognition The use of software to identify alphanumeric characters in static images (i.e., TIFF or PDF files) to facilitate text extraction and electronic search. OCR programs typically create matching text files that are used for text search with the accompanying images.
Processing Encompasses the steps required to extract text and metadata from information items and to build a searchable index. ESI processing tools perform five common functions: (1) decompress, unpack and fully explore (i.e., recurse) ingested items; (2) identify and apply templates (filters) to encoded data to parse (interpret) contents and extract text, embedded objects, and metadata; (3) track and hash items processed, enumerate and unitize all items, and track failures; (4) normalize and tokenize text and data and create an index and database of extracted information; and (5) cull data by file type, date, lexical content, hash value, and other criteria.
Recursion The mechanism by which a processing tool explores, identifies, unpacks and extracts all embedded content from a file, repeating the recursive process as many times as needed to achieve full extraction.
Request for Comment (RFC) The longstandinginformal circulation of proposed protocols and standards among computer scientists, engineers and others interested in the development of the Internet and other networks. RFCs define the structure of email messages and attachments for transmission via the Internet.
SHA Secure Hash Algorithm (SHA) (SHA-1, SHA-256) is a family of cryptographic hash algorithms used for file authentication, file exclusion (DeNISTing) and data deduplication.
SMS Short Message Service (SMS) is a communication protocol that enables mobile devices to exchange text messages up to 160 characters in length.
Solr An open source content extraction, indexing and text search tool used by a number of software providers for indexing and keyword search. It is based on the Lucene open source search engine library project.
Stop Words Common terms purposefully excluded from a searchable index to conserve storage space and improve performance. Also known as “noise words.”
System Files The program and driver files crucial to the overall function of a computer’s operating and file systems. Because system files are not user-created, they may be excluded from a collection of potentially responsive data by deNISTing.
Targeted Collection A technique used to reduce overcollection of ESI by marshaling potentially responsive data based on data characteristics (such as, file type, date, folder location, keyword search, etc.) as opposed to duplicating the entire contents of a storage device (e.g., by imaging).
Threading Collection and organization of messaging as a chronologically ordered conversation.
Tika An open-source toolkit for extracting text and metadata from over one thousand file types, including most encountered in e-discovery. Tika was a subproject of the open-source Apache Lucene project. Lucene is an indexing and searching tool at the core of several commercial e-discovery applications.
Time Zone Normalization The recasting of time values of ESI–particularly of e-mail collections–to a common temporal baseline, often Coordinated Universal Time (UTC) or another time zone the parties designate.
Tokenization A method of document parsing that identifies words (“tokens”) to  be used in a full-text  index. Because computers cannot read as humans do but only see sequences of bytes, computers employ programmed tokenization rules to identify  character sequences that constitute words and punctuation.
 
Western languages typically use spaces and punctuation to identify word (or token) breaks. Because other languages, e.g. Chinese, Japanese and Korean, do not use these methods to break characters into words, l tokenization software ensures that words and other tokens are indexed properly for search.
Unicode An international, multibyte encoding scheme for text, symbols, emoji and control codes. Unicode 13.0 offers 154 encoding schemes or scripts comprising 143,859 characters. Unicode was developed to overcome the limits of the single byte ASCII encoding scheme that lacked the capacity to encode foreign language characters and other symbols needed for international writing and communication. Unicode is now the standard for Western and international text encoding.
 
Unicode Normalization Improves search recall by adding information to an index that locates Unicode characters encoded in multiple ways when searching with any counterpart encoding. Linguistically identical characters encoded in Unicode (so-called “canonical equivalents”) may be represented by different numeric values by virtue of accented letters having both precomposed (é) and composite references (e + ◌́). Unicode normalization replaces equivalent sequences of characters so that any two texts that are canonically equivalent will be reduced to the same sequence of searchable code called the “normalization form” or “normal form” of the original text.
UTF-8 Unicode Transformation Format (character encoding 8) or  UTF-8 is the most widely used Unicode encoding, employing one byte for standard English letters and symbols (making UTF-8 backwards compatible with ASCII), two bytes for additional Latin and Middle Eastern characters, and three bytes for Asian characters. Additional characters may be represented using four bytes.

0


Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

This article has not been revised since publication.

This post was created by JenW on January 27, 2022.

en_USEnglish
X