Advice on how to extract info from multiple PDFs
by Solidsnake23 - Saturday April 5, 2025 at 10:14 PM
#1
Hello Everyone,

first, Moderators apologies if this is not the right place to ask questions, I am new to this forum and still figuring out what is the best approach to use it and collaborate.

I want to learn how to extract data and compile it so I can share/exchange intel

I have currently downloaded a published ransomware database from a car hire company, in this, a lot of full names and addresses, email address and phone numbers are available (3.6k), I want to be able to extract from each PDF the (first name, last name, adress, email, phone number, ...)

My question is, how do you guys do this? do you have your own python script?? what approach do you use

I have converted all pdfs in txt file to make this simpler, I tried to create a python script but with no success (not the same line number and sometimes address gets confused with the company address on the pdf/txt files)

although I will continue trying, I would appreciate if any of you could advise me

Many thanks
Reply
#2
Just throw it into a SQL db or convert it to json so you can parse with jq. It really depends what you are comfortable.
Reply
#3
(04-07-2025, 01:12 AM)argue Wrote: Just throw it into a SQL db or convert it to json so you can parse with jq. It really depends what you are comfortable.

Argue,

the Json idea seems very good, I remember reading about it in Michael Bazzell book but never really gave it all my energy (my mistake) I will definitely try and let you know

thank you for your advice
Reply
#4
If you are comfortable with Python, consider using NLP library using RegEx
Reply
#5
If you're on Mac, use Automator, theres an OCR text extractor built in
"Universal appeal is poison masquerading as medicine. Horror is not meant to be universal. It's meant to be personal, private, animal"
Reply
#6
Have you tried converting to an excel and extracting the data
Reply
#7
(04-07-2025, 01:48 PM)DredgenSun Wrote: If you're on Mac, use Automator, theres an OCR  text extractor built in

  I was just thinking the same thing. lol ! My brother is blind & I took him to a place they gave out free computers & the things to go with them, to people who are blind. They gave him the first OCR program I ever saw at that time, that was illegal to send outside the country, or to copy it. (which I did the day he got it. lol)  I'd try using it first, I know it works & will work with MicroFiche & Film to convert it to a text file.
  Its called OpenBook & has been updated over the years. https://www.freedomscientific.com/produc.../openbook/  
   You can try these also:
https://www.perkins.org/resource/best-oc...-impaired/

I have a few others called: pdftotext_setup & PdfToText
Reply
#8
(04-07-2025, 01:48 PM)DredgenSun Wrote: If you're on Mac, use Automator, theres an OCR  text extractor built in

It has the ability to convert PDF to text but not OCR. Also, pretty sure OP just said they already converted to txt.
Reply
#9
I used this sometimes.
https://github.com/VikParuchuri/pdftext
Reply
#10
Seems most people are gonna keep sending you info on to convert to text, despite you already saying you've done that.

It's not the easiest job to do, depending on what info you're wanting to extract. Like you said, stuff can be put on different lines etc.

You need to find patterns to how the text is presented in the files. Search for common prefixes, put a label on it so you can filter out other data and keep only what you need.

Use Regexes to help where possible. Ones that can target the first instance of a character / string, ones that can target data on every 6th line etc.

Another tip, though irritating, is to really try to place the PDF files into groups before you convert / OCR to text. That way, you know all the files in one folder will be in the same format.

It's not the easiest job, you might need to use several methods to get everything out.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Advice on how to fake a music licensing agreement? Nuhhder 3 253 04-09-2025, 11:34 AM
Last Post: DredgenSun
  I need some IPTV reseller contact info. luciifer2022 2 242 03-27-2025, 05:13 AM
Last Post: luciifer2022
  Latest Breaches or Databases info? Deadshot2026 10 545 03-19-2025, 08:46 PM
Last Post: joepa
  leakbin.info hatechan 4 422 03-04-2025, 07:02 AM
Last Post: nig
  Links with info: start with healthy. boar 2 325 02-28-2025, 05:45 PM
Last Post: Governer

Forum Jump:


 Users browsing this thread: 1 Guest(s)