Possibly Related Threads…

Solidsnake23 · 04-05-2025, 10:14 PM

Hello Everyone,

first, Moderators apologies if this is not the right place to ask questions, I am new to this forum and still figuring out what is the best approach to use it and collaborate.

I want to learn how to extract data and compile it so I can share/exchange intel

I have currently downloaded a published ransomware database from a car hire company, in this, a lot of full names and addresses, email address and phone numbers are available (3.6k), I want to be able to extract from each PDF the (first name, last name, adress, email, phone number, ...)

My question is, how do you guys do this? do you have your own python script?? what approach do you use

I have converted all pdfs in txt file to make this simpler, I tried to create a python script but with no success (not the same line number and sometimes address gets confused with the company address on the pdf/txt files)

although I will continue trying, I would appreciate if any of you could advise me

Many thanks

argue · 04-07-2025, 01:12 AM

Just throw it into a SQL db or convert it to json so you can parse with jq. It really depends what you are comfortable.

Solidsnake23 · 04-07-2025, 09:47 AM

(04-07-2025, 01:12 AM)argue Wrote: Just throw it into a SQL db or convert it to json so you can parse with jq. It really depends what you are comfortable.

Argue,

the Json idea seems very good, I remember reading about it in Michael Bazzell book but never really gave it all my energy (my mistake) I will definitely try and let you know

thank you for your advice

BSEA · 04-07-2025, 10:42 AM

If you are comfortable with Python, consider using NLP library using RegEx

DredgenSun · 04-07-2025, 01:48 PM

If you're on Mac, use Automator, theres an OCR text extractor built in

WitchBrewXX · 04-07-2025, 03:35 PM

Have you tried converting to an excel and extracting the data

OriginalCrazyOldFart · 04-07-2025, 04:58 PM

(04-07-2025, 01:48 PM)DredgenSun Wrote: If you're on Mac, use Automator, theres an OCR text extractor built in

I was just thinking the same thing. lol ! My brother is blind & I took him to a place they gave out free computers & the things to go with them, to people who are blind. They gave him the first OCR program I ever saw at that time, that was illegal to send outside the country, or to copy it. (which I did the day he got it. lol) I'd try using it first, I know it works & will work with MicroFiche & Film to convert it to a text file.
Its called OpenBook & has been updated over the years. https://www.freedomscientific.com/produc.../openbook/
You can try these also:
https://www.perkins.org/resource/best-oc...-impaired/

I have a few others called: pdftotext_setup & PdfToText

HassaMassa · 04-08-2025, 10:15 AM

(04-07-2025, 01:48 PM)DredgenSun Wrote: If you're on Mac, use Automator, theres an OCR text extractor built in

It has the ability to convert PDF to text but not OCR. Also, pretty sure OP just said they already converted to txt.

corvenik · 04-08-2025, 11:21 AM

I used this sometimes.
https://github.com/VikParuchuri/pdftext

HassaMassa · 04-08-2025, 02:18 PM

Seems most people are gonna keep sending you info on to convert to text, despite you already saying you've done that.

It's not the easiest job to do, depending on what info you're wanting to extract. Like you said, stuff can be put on different lines etc.

You need to find patterns to how the text is presented in the files. Search for common prefixes, put a label on it so you can filter out other data and keep only what you need.

Use Regexes to help where possible. Ones that can target the first instance of a character / string, ones that can target data on every 6th line etc.

Another tip, though irritating, is to really try to place the PDF files into groups before you convert / OCR to text. That way, you know all the files in one folder will be in the same format.

It's not the easiest job, you might need to use several methods to get everything out.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Advice on how to fake a music licensing agreement?	Nuhhder	3	253	04-09-2025, 11:34 AM Last Post: DredgenSun
	I need some IPTV reseller contact info.	luciifer2022	2	242	03-27-2025, 05:13 AM Last Post: luciifer2022
	Latest Breaches or Databases info?	Deadshot2026	10	545	03-19-2025, 08:46 PM Last Post: joepa
	leakbin.info	hatechan	4	422	03-04-2025, 07:02 AM Last Post: nig
	Links with info: start with healthy.	boar	2	325	02-28-2025, 05:45 PM Last Post: Governer