OCR is the process of extracting text from images during ingest. DISCO looks at each line in an image and attempts to determine if the black and white dots represent a letter or number. What is recognized as text is then placed in a PDF.
DISCO does not OCR small images (less than 100 x 100 pixels) or non-text elements in a page such as text drawn using vector graphics.
OCR is used in combination with the language identification module to identify non-English languages. The process begins with the language identification module interrogating the text of each page (which is stored in Unicode format) within a document and determining which languages the text represents. Then the OCR module examines the images on each page to understand the scripts present for the supported languages. When a supported script is identified, a sampling OCR runs in that script to extract text.
Once more, the language identification module performs, this run occurs on the sampled OCR text. Once every page has been evaluated within a document, the system will make a final determination of language. If an identified language is not within the set of supported languages, it will be designated as “undetermined”.
OCR is performed on all images on all pages of the entire document in the identified language. The text extracted with OCR is then parsed into words using a tokenizer associated with the identified language. When this document is used in predictive tagging, it will be used as a model of the identified language.