Advice on how to extract info from multiple PDFs
by Solidsnake23 - Saturday April 5, 2025 at 10:14 PM
#11
(04-08-2025, 02:18 PM)HassaMassa Wrote: Seems most people are gonna keep sending you info on to convert to text, despite you already saying you've done that.

It's not the easiest job to do, depending on what info you're wanting to extract. Like you said, stuff can be put on different lines etc.

You need to find patterns to how the text is presented in the files. Search for common prefixes, put a label on it so you can filter out other data and keep only what you need.

Use Regexes to help where possible. Ones that can target the first instance of a character / string, ones that can target data on every 6th line etc.

Another tip, though irritating, is to really try to place the PDF files into groups before you convert / OCR to text. That way, you know all the files in one folder will be in the same format.

It's not the easiest job, you might need to use several methods to get everything out.

Damn man, I know who to pester for advice like this. Superb!
"Universal appeal is poison masquerading as medicine. Horror is not meant to be universal. It's meant to be personal, private, animal"
Reply
#12
I personally automate all this stuff with python, just ask gpt to do it if you're too stupid to write it yaself
Thank you for ranks @ Al-Sheikh and @ 5150 !
Reply
#13
(04-08-2025, 11:21 AM)corvenik Wrote: I used this sometimes.
https://github.com/VikParuchuri/pdftext

Good tip, thanks!
Reply
#14
thank you all for your advices

I was going to create a python script with lots of regex to extract the data needed then I found this tool https://github.com/DocumindHQ/documind
this allows to extract data even from PDFs that are images (using ocr), all automated if you're lazy

plenty of potential on this one
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Advice on how to fake a music licensing agreement? Nuhhder 3 253 04-09-2025, 11:34 AM
Last Post: DredgenSun
  I need some IPTV reseller contact info. luciifer2022 2 242 03-27-2025, 05:13 AM
Last Post: luciifer2022
  Latest Breaches or Databases info? Deadshot2026 10 545 03-19-2025, 08:46 PM
Last Post: joepa
  leakbin.info hatechan 4 422 03-04-2025, 07:02 AM
Last Post: nig
  Links with info: start with healthy. boar 2 325 02-28-2025, 05:45 PM
Last Post: Governer

Forum Jump:


 Users browsing this thread: 1 Guest(s)