Posts: 21
Threads: 2
Joined: Apr 2025
Hello Everyone,
first, Moderators apologies if this is not the right place to ask questions, I am new to this forum and still figuring out what is the best approach to use it and collaborate.
I want to learn how to extract data and compile it so I can share/exchange intel
I have currently downloaded a published ransomware database from a car hire company, in this, a lot of full names and addresses, email address and phone numbers are available (3.6k), I want to be able to extract from each PDF the (first name, last name, adress, email, phone number, ...)
My question is, how do you guys do this? do you have your own python script?? what approach do you use
I have converted all pdfs in txt file to make this simpler, I tried to create a python script but with no success (not the same line number and sometimes address gets confused with the company address on the pdf/txt files)
although I will continue trying, I would appreciate if any of you could advise me
Many thanks
Posts: 69
Threads: 2
Joined: Jul 2024
Just throw it into a SQL db or convert it to json so you can parse with jq. It really depends what you are comfortable.
Posts: 21
Threads: 2
Joined: Apr 2025
(04-07-2025, 01:12 AM)argue Wrote: Just throw it into a SQL db or convert it to json so you can parse with jq. It really depends what you are comfortable.
Argue,
the Json idea seems very good, I remember reading about it in Michael Bazzell book but never really gave it all my energy (my mistake) I will definitely try and let you know
thank you for your advice
Posts: 50
Threads: 3
Joined: Mar 2025
If you are comfortable with Python, consider using NLP library using RegEx
Posts: 10,305
Threads: 216
Joined: Jun 2023
If you're on Mac, use Automator, theres an OCR text extractor built in
"Universal appeal is poison masquerading as medicine. Horror is not meant to be universal. It's meant to be personal, private, animal"
Posts: 42
Threads: 5
Joined: Mar 2025
Have you tried converting to an excel and extracting the data
Posts: 759
Threads: 92
Joined: Jan 2024
(04-07-2025, 01:48 PM)DredgenSun Wrote: If you're on Mac, use Automator, theres an OCR text extractor built in
I was just thinking the same thing. lol ! My brother is blind & I took him to a place they gave out free computers & the things to go with them, to people who are blind. They gave him the first OCR program I ever saw at that time, that was illegal to send outside the country, or to copy it. (which I did the day he got it. lol) I'd try using it first, I know it works & will work with MicroFiche & Film to convert it to a text file.
Its called OpenBook & has been updated over the years. https://www.freedomscientific.com/produc.../openbook/
You can try these also:
https://www.perkins.org/resource/best-oc...-impaired/
I have a few others called: pdftotext_setup & PdfToText
Posts: 1,091
Threads: 8
Joined: Jun 2023
(04-07-2025, 01:48 PM)DredgenSun Wrote: If you're on Mac, use Automator, theres an OCR text extractor built in
It has the ability to convert PDF to text but not OCR. Also, pretty sure OP just said they already converted to txt.
Posts: 14
Threads: 0
Joined: Mar 2025
Posts: 1,091
Threads: 8
Joined: Jun 2023
Seems most people are gonna keep sending you info on to convert to text, despite you already saying you've done that.
It's not the easiest job to do, depending on what info you're wanting to extract. Like you said, stuff can be put on different lines etc.
You need to find patterns to how the text is presented in the files. Search for common prefixes, put a label on it so you can filter out other data and keep only what you need.
Use Regexes to help where possible. Ones that can target the first instance of a character / string, ones that can target data on every 6th line etc.
Another tip, though irritating, is to really try to place the PDF files into groups before you convert / OCR to text. That way, you know all the files in one folder will be in the same format.
It's not the easiest job, you might need to use several methods to get everything out.
|