What is OCR or Optical Character Recognition?

Computer with funnel with papers with writing

[Editor’s Note: EDRM is proud to support our Guardian Plus Partner, Zapproved, in their educational efforts.]

What is OCR or Optical Character Recognition?

Optical character recognition, or OCR, refers to the use of technology that can recognize letters, numbers, and other written characters. This allows it to convert images or scanned paper documents into searchable electronic text. The basic process of OCR involves examining the text of a document and translating the characters into code that can be used for data processing. OCR is sometimes also referred to as text recognition.

OCR systems are usually made up of a combination of hardware and software that is then used to convert physical documents into machine-readable text. The hardware, such as an optical scanner or specialized circuit board, is used to copy or read text while the software typically handles the advanced processing.

Surprisingly, OCR was first used over 100 years ago in the Optophone, a reading device for the blind that translated letters into sounds. Today, OCR is much more advanced, using artificial intelligence to identify characters. This allows it to recognize letters even when they appear remarkably different. It can, for instance, identify the letters “a” and “g” despite dramatic differences in their structures across different common fonts. In some applications — common in mail sorting, but not in ediscovery — OCR can even be used to decipher handwriting.

How Does OCR Work with Ediscovery

In ediscovery, OCR is used to convert electronic or paper-based discovery into computer-based text. When discoverable materials are received as images such as TIFF files, software with OCR can identify any letters, numbers, or other text characters. It then converts those characters from pixel-based pictures into readable text.

Similarly, OCR can be used when discovery involves physical documents such as typed letters or printouts. Those physical pages can be scanned, processed through an OCR system, and rendered as computer files in a fraction of the time it would take a human to read them.

Typically, the OCR process is very complex, starting with pre-processing. OCR engines work with a variety of inputs, from photos to book pages to receipts, and perform a number of different pre-processing steps to normalize this in-bound data. Including deskewing, which means aligning the page on a perfect plane, whether horizontal or vertical, and removing lines and spots, and reading the structure of the text.

Once the pre-processing is complete, OCR begins to recognize characters by means of two methods, feature extraction and pattern matching. The former uses machine learning to develop a nuanced understanding of the features that might define a text, with accuracy up to 99%. The latter compares each character pixel-by-pixel with a library of stored character images in search of a match and is typically a more outdated form of OCR.

OCR for Image and Paper-Based Discovery

The primary advantage is that electronic text, unlike images or paper, is fully searchable. This eliminates the need for human review teams to laboriously flip through stacks of paper discovery. Instead, a legal team can scan documents, use OCR to generate text files, and then search those files for keywords, names, dates, and any other text-based content. This significantly reduces the amount of time required to process or review images – or paper-based discovery. As a result, OCR can streamline and simplify discovery, allowing for better early case assessment (ECA). It can also drastically lower costs, particularly within the review phase.

Unlike paper-based documents, electronically stored information (ESI) can be edited, copied, and distributed almost instantaneously.

Digitizing paper documents also obviates the need for physical storage space or file organization. Unlike boxes of paper discovery, which are susceptible to damage or destruction through fire, flooding, or natural disasters, OCR allows discoverable information to be extracted as text and electronically stored. This eliminates the risk of misplacing a specific piece of paper or wasting time locating a particular document.

The Realities of Using OCR in Ediscovery Today

Of course, OCR has its shortcomings for physical documents. Despite considerable advances, it is still not 100 percent accurate, which can lead to misspelled words and missed search terms. Nonetheless, OCR may eventually render paper-based discovery entirely obsolete, as what remains on paper can be rapidly ingested for processing as ediscovery. And for imaged documents, OCR offers the ability to review tifs and pdfs as easily as other types of ESI.

Interested in using OCR during ediscovery? Look for software that includes OCR as part of the ESI data processing and review tools.

Glossary definition

OCR or optical character recognition is a technology that can identify letters, numbers, and other characters, converting images or scanned paper documents into searchable electronic text.

Author

Mary Mack

Mary Mack is the CEO and Chief Legal Technologist for EDRM. Mary was the co-editor of the Thomson Reuters West Treatise, eDiscovery for Corporate Counsel for 10 years and the co-author of A Process of Illumination: the Practical Guide to Electronic Discovery. She holds the CISSP among her certifications.

View all posts