04-09-2025, 09:31 AM
(04-08-2025, 02:18 PM)HassaMassa Wrote: Seems most people are gonna keep sending you info on to convert to text, despite you already saying you've done that.
It's not the easiest job to do, depending on what info you're wanting to extract. Like you said, stuff can be put on different lines etc.
You need to find patterns to how the text is presented in the files. Search for common prefixes, put a label on it so you can filter out other data and keep only what you need.
Use Regexes to help where possible. Ones that can target the first instance of a character / string, ones that can target data on every 6th line etc.
Another tip, though irritating, is to really try to place the PDF files into groups before you convert / OCR to text. That way, you know all the files in one folder will be in the same format.
It's not the easiest job, you might need to use several methods to get everything out.
Damn man, I know who to pester for advice like this. Superb!
"Universal appeal is poison masquerading as medicine. Horror is not meant to be universal. It's meant to be personal, private, animal"