Home > Forum > Actions > Action to calculate similarity score of attachment against template

Action to calculate similarity score of attachment against template
0

Business case: due to organization regulation it is required to fill in a specific form and attach it as PDF to a WEBCON case. The problem is how to assure that such PDF file matches prefedined format and that is is filled with reasonable contect
Expectation: dedicated action that will take two attachments: template in PDF or WORD and actual document in PDF or WORD and will calculate similarity score from 0 - documents differ completely to 1.00 - documents match perfectly. Alternatively TRUE (document matches template and has reasonable content) or FALSE (document doesn't match)
I look for WEBCON integrated solution
Any ideas?

MVP

Hi Jarosław,
I think there won't be a fit-all solution here, this would have to be developed per use-case.

Is the PDF 1 page, 10 pages, 100 pages, 1k pages?
Those pages are text, graphics, screenshots, scans, if there are images, how would you decide if those are reasonable?

You might wanna check the tools below, but it's far from a put in action and forget :)
* https://techcommunity.microsoft.com/blog/azure-ai-services-blog/extract-data-from-pdfs-using-form-recognizer-with-code-or-without/2214299 -
* https://github.com/invoice-x/invoice2data - it's not only for invoices, my attempt would be to parse pdf with this tool, and then check if output structure matches your template structure.

Those below would require coding a solution and expose it via API, so it's more advanced:
* https://github.com/Layout-Parser/layout-parser
* https://docs.opencv.org/4.x/d4/dc6/tutorial_py_template_matching.html

MVP

Hi Jaroslav

I did some tests with the AI Prompt business rule.
Turns out to do quite a nice job. Uploaded two documents and used ""Check the similarity between these documents and rate it between 0 and 100.
Only return the similarity as an integer value" as instruction. The highest rate was 95 with two identical documents.
Maybe this is good enough to try?