Possibly Related Threads…

DredgenSun · 04-09-2025, 09:31 AM

(04-08-2025, 02:18 PM)HassaMassa Wrote: Seems most people are gonna keep sending you info on to convert to text, despite you already saying you've done that.

It's not the easiest job to do, depending on what info you're wanting to extract. Like you said, stuff can be put on different lines etc.

You need to find patterns to how the text is presented in the files. Search for common prefixes, put a label on it so you can filter out other data and keep only what you need.

Use Regexes to help where possible. Ones that can target the first instance of a character / string, ones that can target data on every 6th line etc.

Another tip, though irritating, is to really try to place the PDF files into groups before you convert / OCR to text. That way, you know all the files in one folder will be in the same format.

It's not the easiest job, you might need to use several methods to get everything out.

Damn man, I know who to pester for advice like this. Superb!

**938** · 04-11-2025, 10:22 AM

I personally automate all this stuff with python, just ask gpt to do it if you're too stupid to write it yaself

joepa · 04-11-2025, 10:45 AM

(04-08-2025, 11:21 AM)corvenik Wrote: I used this sometimes.
https://github.com/VikParuchuri/pdftext

Good tip, thanks!

Solidsnake23 · 04-11-2025, 10:47 PM

thank you all for your advices

I was going to create a python script with lots of regex to extract the data needed then I found this tool https://github.com/DocumindHQ/documind
this allows to extract data even from PDFs that are images (using ocr), all automated if you're lazy

plenty of potential on this one

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Advice on how to fake a music licensing agreement?	Nuhhder	3	253	04-09-2025, 11:34 AM Last Post: DredgenSun
	I need some IPTV reseller contact info.	luciifer2022	2	242	03-27-2025, 05:13 AM Last Post: luciifer2022
	Latest Breaches or Databases info?	Deadshot2026	10	545	03-19-2025, 08:46 PM Last Post: joepa
	leakbin.info	hatechan	4	422	03-04-2025, 07:02 AM Last Post: nig
	Links with info: start with healthy.	boar	2	325	02-28-2025, 05:45 PM Last Post: Governer