Artificial intelligence (AI) advancements are blurring the lines between structured information (databases, spreadsheets, data files) and unstructured content in natural language. Consequently, there’s a growing demand for innovative systems capable of processing, analyzing, and extracting valuable insights from documents. These tasks are increasingly undertaken by so-called Large Language Models (LLMs) operating on cloud platforms like AWS, Google Cloud Platform (GCP), Microsoft, and OpenAI.
In response to these needs, we have released a new version of our solution for monitoring the quality of digital documents, DocsQuality.
This latest version is enhanced with the OCRIndex value calculation feature, enabling users to verify whether a PDF file submitted to an LLM model or a document circulation system will be correctly processed by an Optical Character Recognition (OCR) engine.
Meeting expectations: DocsQuality new feature
OCRIndex is a numerical measure indicating how well OCR (Optical Character Recognition) software can read text from electronic documents, including images or scanned writings. It considers the image quality, particularly font characteristics, and detects document defects such as compression, blurring, contrast, etc. A higher OCRIndex suggests a higher likelihood of accurate character recognition.
Method for determining the OCRIndex
To determine the OCRIndex, in addition to existing OCR tools, it was necessary to use a proprietary algorithm designed for recognizing unreadable text (printed and handwritten). The method employed in DocsQuality is used to establish the OCR index for a single page of the analyzed document. The result is indicated on a scale from 1 to 100.
OCRIndex can be understood as the percentage of text in the source document that, upon analysis, will be correctly converted into a string of characters. Thus, the optimal value is 100, for example, for documents saved in a vector format (e.g., vector PDF). However, if the OCR engine struggles to read the document, the OCRIndex will be 0.
Preparing training data
One of the fundamental problems in analyzing the quality of visually weak documents is assessing whether the OCR process has omitted fragments containing valuable information. To calculate the OCRIndex, our team used a set of selected documents containing both readable and unreadable character strings (printed and as if written by hand). Each document image was analyzed in OCR systems such as Tesseract and Document Intelligence Studio so that the analysis result included both the detected text and the areas where character strings were not identified by the OCR engine (Fig).
Then, this data underwent a visual evaluation process aimed at assessing the extent to which the document had been correctly processed by OCR. For each document, an expert assessed on a scale from 0 to 10 what percentage of the text had been correctly analyzed by both programs. The OCRIndex determined by the expert represented the average value from the most commonly used OCR systems. To support the expert’s work, DocsQuality provided additional information about areas of the document that the system considered unreadable. A key improvement to the OCRIndex is that DocsQuality can now identify text in the analyzed document that might have been missed by the OCR engine.
Comment from the CEO of Inero Software
“Among our clients and partners, we are seeing an increasing interest in the automated processing of information contained in PDF files. Such documents are often delivered either as scans or photos, which in some cases can cause errors in their analysis using OCR systems or LLMs.
Thanks to OCRIndex, users are now able to quickly and automatically verify whether a document can be subjected to automated OCR processing or whether it might instead require the supervision of a specialist. In this way, we save time and avoid errors that result from implementing documents of low visual quality into circulation.“