Applies to version: 2021.1.x and above; author: Daniel Półchłopek
Introduction
In WEBCON BPS it is possible to search the contents of PDF attachments containing text using the ABBYY FineReader program. FineReader converts scanned documents and PDF files into editable form, and it is responsible for generating a text layer on attachments in workflows. This allows users to easily convert documents as well as search files.
The procedure of FineReader installation is described in our technical blog - FineReader 11 installation.
The article describes how you can configure the process of searching for documents based on two exemplary workflows.
The “Add a text layer” action also supports .jpg and .png files. Graphic files are converted to PDF format, and then the text layer is applied. The effectiveness of such an operation depends on the quality of the image file. Searching in the content of files is also possible for .txt and .docx formats.
First example
The workflow is used to manage archive documents. Employees are responsible for the segregation of documents that have been used in workflows by placing them in an appropriate folder. To facilitate the work, you can introduce the function of applying a text layer to scanned documents. Thanks to this, a user after entering the appropriate phrase in the system, can quickly search for the file. The user starts the workflow which can be related e.g. with a specific document or folder, and prints the barcode. Then the file is scanned and in PDF format is sent to HotFolder, and based on the sticker code is placed in the related instance.
At the next step, the text layer is applied to this attachment so that you can search for it later. The search is based on the SOLR search engine – the use of SOLR technology allows the user to narrow down the search which allows for more relevant results. See - Searching structure in WEBCON BPS Portal.
Configuration of additional devices and elements:
The workflow consists of several steps:
Fig. 1. Workflow diagram – example 1
At the “Overlay the text layer” path in the “Wait for scan” step, add the Add a text layer action and configure it as in the screenshot below.
Fig. 2. The configuration of the “Add a text layer” action
Output file resolution (DPI) - defaults to the value of the source file. It is expressed in the number of image points per inch,
Image layer quality (%) - the parameter is set to 90% by default,
Output file format - specifies the file format that will be the result of the action execution,
Text layer language – defines in what language the text recognition process will be executed. If the language of the document(s) is known, it is highly recommended to set it as the text layer language to greatly improve the accuracy of text recognition. Selecting the wrong language may result in diacritic signs being ignored or recognized incorrectly. If the documents are in Russian, Ukrainian or Hebrew, it is recommended to additionally set English as a language – this will enable any Latin alphabet characters to be recognized correctly. When the document’s language is not known, automatic mode is the recommended setting. Its usage may result in poor quality text recognition and increased processing time.
Mode – defines in what form an attachment with the text layer is to be attached to the instance. There are three possible options:
Priority - determines the urgency with which the workflow instance is put into the queue. Priority can be assigned a value between 1-10, where 1 is the highest priority, and 10 being the lowest. Checking the "Nighttime" box will cause instances to be put into the queue with an absolute lowest priority setting of 11, and the action will only process these instances after working hours (during night hours).
Error handling – determines what happens when a queue element encounters an error for the first time. There are two possible options:
If you change the setting from "Wait for user decision" to "Retry operation automatically", the processing of all queued elements whose attempt counter is less than 5 will resume. This only applies to elements that were queued by the action that is being modified (elements queued in the same queue by other actions will not be affected).
If you change the setting from "Retry operation automatically" to "Wait for user decision", the processing of all pending elements whose attempt counter is greater than 5, will stop. This only applies to elements that were queued by the action that is being modified (elements queued in the same queue by other actions will not be affected).
Reloading service configuration is recommended after changing this setting.
Input files types - specifies the file type of source attachments,
Filter by regular expression - a field that allows you to enter a regular expression for selecting files for which the text layer will be generated,
Category - defines the category of source attachments. There are three possible options:
PDF attachments to which a text layer is to be applied are added to the queue and analyzed according to the value of the priority assigned to them. After applying a text layer, the system moves the modified PDF document to the final step and its content can be searched.
Please note that the text layer can only be placed on a file that has not been protected with a password.
Application example
Below there is the form with the scanned PDF document.
Fig. 3. Registration of a new instance
On the main page of WEBCON BPS Portal, enter the “…form at the cost invoice registration step” sentence which is included in one of the documents. The instance containing this document will appear in the search results.
Fig. 4. Search for an instance based on a fragment of its content
To present the “Add a text layer” action, the full OCR workflow for invoices was used. The configuration of this process is presented in this article - The OCR verification view for the MODERN form.
Fig. 5. Invoice workflow
Attachments added to the invoice are not processed by OCR - only the text layer is applied to them.
The example includes a collective invoice for renovation and construction services. As an attachment to the invoice, the warranty rules for the services were added.
Fig. 6. View of the considered workflow form
After going through the "Waiting for text layer" step, it is possible to search by the content of the PDF attachment using the search engine. On the main page of WEBCON BPS Portal, enter the Warranty rules for painting services” sentence which is included in one of the documents. The instance containing this document will appear in the search results.
Fig. 7. Search for an instance based on a fragment of its content
Summary
The article presents two examples of using the "Add a text layer" action in practice. It allows you to significantly speed up searching the content of PDF documents. Instead of doing it manually, the user can enter any phrase in the search engine, and the system will easily find the document in which it appears. The described functionality significantly facilitates the work of the employee and significantly increases their efficiency.
Useful links
[1] OCR verification view for the MODERN form
[2] Installation of a barcode printer
[3] HotFolder - attaching scanned files to the process