PDF - Recognize - Cut - Extract - send - update EspoCrm

**emillod** · 02-17-2023, 10:51 AM

Hello Sir,
is it actually related to EspoCRM? Or it's a marketing of your application? As you writing in your post, it's not even connected to EspoCRM, because you have it in TO DO ("Update ESPOCRM (to do)").
I watched your video and there is nothing about EspoCRM.

**enricorossa** · 02-17-2023, 10:56 AM

Hello, no it's not marketing because the application is not for sale and will be available in a free version, if you look at the bottom I wrote

" If you are interested in helping me design a plugin to update espo, please contact me
or you know an existing module to adapt please contact me"

I'd just like to understand if someone might be interested in such an app and how to update espo if there is already some module that takes a json and processes it to update the expo entities.

I apologize for my bad presentation and for my English.

If I have not been appropriate, I will delete the post

**emillod** · 02-17-2023, 11:15 AM

enricorossa no problem. I just wanted to clarify this matter

Don't worry

**esforim** · 02-20-2023, 05:03 AM

Hi there,

Looking at the video I can see what it purpose it and what you trying to do! And it is awesome!

I been looking for something like this for quite a while, the closest is using tools that is hidden away or the server requirement is too difficult and not accessible to Shared Hosting type of server setup. Hopefully that not the case.

For those that don't understand it basically exact data from PDF into a Text format.

Think of it as PDF-to-Text Field

PDF can always export to Text and not hard, export certain key data is what difficult and I haven't seen any software that do that yet: even offline/desktop software, at least I can't find one.

I'm looking forward to see the result of this project, as your video doesn't show what the Result.txt look like... one feature I'm hoping it to export it into field, or perhaps CSV format or some other format where we can import/data/API Rest easier these exacted information.

Secondly, is it possible to exact from keywords instead of "Geographic" location? For example, it would search for "Date:" and get all data after the word "Date:" and stop after a certain numbers of characters, or stop after the line end, etc. This might be more difficult to do?

I don't think there anything similar in EspoCRM yet.

Some other similar project: free, paid or close source are as follow;

GitHub - lgmarin/pdf_data_extract: Extract data from PDF file with Python

https://github.com/lgmarin/pdf_data_extract

Extract data from PDF file with Python. Contribute to lgmarin/pdf_data_extract development by creating an account on GitHub.

How to Extract Data from PDFs Using Different Methods?

https://nanonets.com/blog/extract-data-from-pdf/

PDF data extraction is a common problem faced by organizations. This article covers 6 popular ways to extract data from PDF files in 2024.

How To Extract Data From PDFs

https://docparser.com/blog/extract-data-from-pdf/

The PDF is here to stay. In today’s work environment, the PDF became ubiquitous as a digital replacement for paper and holds a variety of important

**enricorossa** · 02-20-2023, 08:14 AM

My app allows you to graphically configure the rules that generate the commands to be executed by the poppler-util, using pdfToText.exe or PdftoCairo.exe you can recognize, export data and cut pdfs.
Previously I had to write all the rules by hand in a file which I then executed via the command line and wasted a lot of time, especially when the pdf templates were modified,
it was very difficult to test the commands with the right coordinates to find the position of the text to extract.
We are transferring everything from the old CRM where I had written a module that takes the txt and updated the data, now the time has come to pass this too to ESPOCRM and I was wondering if there is a module to adapt.

1) I haven't tested it on server hosting yet because it's not possible to install the poppler-util, the executable that processes the commands generated by my app
Example of command for extracting text
pdftotext.exe -f 1 -l 1 -r 150 -nopgbrk -x 685 -y 122 -W 308 -H 23 spool/1/spezzati/cedolini_ydfrL7H2W2.pdf spool/1/tmp/bPxpXTENkT/bPxpXTENkT.txt

2) I currently use it locally and on a domain on VPS servers and it performs very well

3) the software exports a JSON inside a txt

[
{
"id":33,
"idregola":10,
"pagina":1,
"x":71,
"y":128,
"w":99,
"h":16,
"variabile":"matricola",
"valore":"0000000011"
},
{
"id":34,
"idregola":10,
"pagina":1,
"x":273,
"y":125,
"w":320,
"h":25,
"variabile":"cognome",
"valore":"PINCO"
},
]

4) Yes, you can extract as many data as you want and process it by telling it to extract up to N characters or up to another word
(I have to implement the code for this function, but it can be done)

The interesting thing about this app is that if you have 2 pdfs with the same keyword you are able to distinguish the two pdfs based on the geographical area.

ES.
PDF1 contains the word INVOICY in the area 70, 120, 99, 16 and is an invoice
PDF2 contains the word INVOICY in the area 80, 140, 110, 20 but this is not an invoice but a summary.

The software recognizes them and saves the pdfs in a folder of your choice with different names, the splitting and extracting rules work on a suffix of the name and apply the rule of
splitting or extracting only on that suffix, by doing so you can structure a behavior that allows you to automate all the processing processes of the pdf, you save it in the folder
of spool and you forget it, you will find the entered or updated data in espocrm.

**enricorossa** · 02-22-2023, 05:25 PM

Good morning, I'm writing the espo module and testing the insertion, I can't understand how to load the list of fields of an entity via the api, currently I load a record from the entity and retrieve the name of the fields but doing so some fields like teams they are not loaded. Can you give me a hand? Thank you

**enricorossa** · 03-01-2023, 08:17 AM

I finished the module for importing into espocrm via API

A few more tweaks like:

1) fetch data from an external source via data extraction variables
2) Relations with other entities

Watch the video indicated in the first post

PDF - Recognize - Cut - Extract - send - update EspoCrm

PDF - Recognize - Cut - Extract - send - update EspoCrm

Comment

Comment

Comment

Comment

Comment

Comment

Comment