PDA

View Full Version : Text from PDF (poor OCR)



grimm182
June 24th, 2018, 23:50
Anyone having any luck getting text out of older modules in pdf format?
No matter what I try the text is so poorly formatted it would be quicker to type it all from scratch.
Tried various OCR option in Acrobat XI and best i can do i 85% accurate but a buttload of double spaces!!

[MODERATOR - removed request for copyright data. Please don't request copyright protected data on these forums - even if you own the original product. Thanks.]

Zacchaeus
June 25th, 2018, 00:23
Even the best quality pdf’s Will have multiple double spaces. Copy into something like notepad++ first and clean up the text using search/replace before pasting into FG.

celestian
June 25th, 2018, 00:39
Even the best quality pdf’s Will have multiple double spaces. Copy into something like notepad++ first and clean up the text using search/replace before pasting into FG.

I've actually found that fixing it in bulk from the db.xml file is a better method for me. Just search and replace double spaces in the xml file. I hate having to copy from the PDF then drop it into notepad then copy/paste that into FG. This way I just copy/paste it all into FG and then fix the db.xml when I'm completely done.

Talyn
June 25th, 2018, 01:19
I'm with Zacc. I always export PDF to plain text (or accessible if the plain text is botched). If it comes down to copying direct from the PDF, something has gone horribly awry and that's my second-to-last option before just re-typing it myself from scratch. I have 3 or 4 search/replaces setup in Notepad++ (I could probably turn them into a macro if I could be bothered to) that cleans up 99% of the textual issues (assuming the exported text was in a usable state). But I'm always working with "newer" PDFs not old scanned "pictures turned into a PDF" which are just a hot mess to begin with, nevermind getting the text out of them.

grimm182
June 25th, 2018, 01:43
Since this is something that A LOT of people do for Fantasy Grounds, think there is way to automate this somehow?
Some sort of plug-in or extension?
Or anyone want to do a step by step instruction?

Good to hear feedback on how others do it, so thanks for everyone chiming in!

Talyn
June 25th, 2018, 01:54
Since this is something that A LOT of people do for Fantasy Grounds, think there is way to automate this somehow?

Could you elaborate on what part of the discussion qualifies as "this?"

Both Zacc and I use Notepad++ to work on the raw text files. I then work in raw XML to build my DLC whereas Zacc works in a separate plain text file to create the markdown the PAR5E (not available to the public) tool uses to create its XML output.

I'm not sure if the free Acrobat Reader has the File -> Save As... text export; I use Acrobat Pro but there are plenty of third-party PDF readers and editors that will do a plain text export.

grimm182
June 25th, 2018, 02:11
To clarify "this", i meant that FG users possibly quite often take text from PDFs and convert them over...It would be nice if there was a community tool for this.

That being said i am trying the above methods out and see how things pan out...so thanks again.

LordEntrails
June 25th, 2018, 02:32
The problem with converting PDFs has nothing to do with FG. It has everything to do with with the quality of the original file (which is in PDF). Any tool (such as Par5E, NPC Engineer, Spell Engineer, etc.) is going to have the same issue: "garbage in, garbage out."

You have to get the content cleaned up after extracting it from the PDF. That has nothing to do with FG, and everything to do with OCR. You can look at tools like these (https://www.google.com/search?q=ocr+cleanup+software&rlz=1C1CHBD_enUS784US785&oq=ocr+cleanup&aqs=chrome.2.69i57j0l5.4139j0j7&sourceid=chrome&ie=UTF-8), I've never tried them.

JohnD
June 25th, 2018, 02:50
You are at the mercy of whether or not the person who put the PDF together did a good job.

Talyn
June 25th, 2018, 02:55
There are tools such as the Author extension available on the forums (I have not yet tried it), or the Savage Worlds Extended Library (SWEL) which I have used, but mostly I just stick with Notepad++ because I have absolute control over the quality of the output that way.

However, as @LordEntrails noted, regardless how we accomplish things, we require quality source material to start with. Some PDFs are built in downright amateurish and irresponsible manners, or the worst are the cases of someone scanning or taking photos of pages then releasing those as a PDF. Those are usually pirate copies, though occasionally you'll get official reprints of products so old they're out of print and the dubious scan quality was the best the publisher could do in a bad situation.

Lexfire
June 26th, 2018, 01:53
I have been using Google Drive and their OCR software to convert Dungeon magazine adventures from image pdfs. The software converts text best when presented in separate column images.

1. I select an adventure, and use Windows snipping tool to create an image for each column from each page for the entire adventure.
2. An adventure on page x with 3-column would result in 3 images called x.1, x.2, and x.3.
3. All Images are loaded to google drive folder.
4. Right click on the image and select “Open with – Google Docs”.
5. The software converts the word images to text, and the resulting text from the original image is below.

Here is the Original image from Dungeon Magazine 41 page 12.

https://i.imgur.com/cXB049q.jpg

Here is the text output image in the OCR.

https://i.imgur.com/ldlbVrn.jpg


Took me about an hour to window snip, post, convert, and create a single Dungeon adventure text document with a 11,000+ words count for a 2nd edition adventure. Then I pasted that text into my FG template.

grimm182
September 6th, 2018, 12:55
I just stumbled across the damnest thing (no offense Damned haha), so...converting text from Isle of Dread (Blue).
If i select the text as normal. i run into the same old issue I always do with older pdfs, bad recognition and horrible spacing.
BUT, if i go into edit text mode (acrobat) i can copy and paste text flawlessly....

The problem with this is every paragraph is its own text box, oddly quicker to copy paste this way and never have to edit or fix formatting.
I know extra formatting is partially to blame...but why cant i select whole blocks of text with the same accuracy as the text edit tool!!!!
Any ideas?

Octavious
September 6th, 2018, 16:10
As long as you can copy the PDF text and paste it into FG as text but the formatting is bad ( Extra spaces etc..) just select the text and do Ctrl-J and FG will format it for you..

grimm182
September 6th, 2018, 16:24
I painfully know the Ctrl-J fix. To put it another way, if the text box was the whole page vs. text boxes of individual paragraphs, then i could input an entire page in 2 steps. (copy+paste) :)
The way I have done it in the past, was to Cnt-J every paragraph within FG...again this is an issue with older pdfs for me.

LordEntrails
September 6th, 2018, 16:57
It depends on how the PDF was created. In your specific case, I believe the issue the edit mode doesn't behave like you want is the file was originally authored in a tool that had separate text boxes. Therefore they are separate boxes when you are in edit mode. At least that is what I suspect. And since general select mode is doing more of na on the fly OCR they behave differently.

Mirloc
September 11th, 2018, 14:57
All PDFs are not created equally. Think of it less like a .txt file where the text is written and perfect because the standard definition of the letter "T" has not changed since the first printed page. Literally raw text stores the pins to fire on a dot-matrix printer, the exact same system used to transmit the block of information in it's rawest form. But the pretty text, particularly text like drop-caps, italicized, and worse yet handwriting fonts can cause the OCR to incorrectly identify the words. Plus formatting of paragraphs can include extra spaces within the line so that there's less whitespace.

PDFs come in several flavors: Images, converted text and raw text. The differences are VAST.
- Raw text are your best, these are 'printed' to PDF or saved as a PDF from some other piece of software. This includes the raw text, along with the font-mapping and image files necessary to make the system display the text however. These files are typically larger (the fonts are embedded into the PDF file along with the text and pictures). You can copy/paste these to your hearts content. A block of text is copied as a string of characters, and require little to no touch-up work to put it into FG.
- Converted text are converted from printed source material, and the parts are identified and placed into the document, but they aren't perfect and often times misspellings or unintentional letter loss happens. These are at the mercy of OCR (which has gotten loads better since the late 90s but it's still far from perfect), and touch-ups that need to happen rarely are, as the text is perfectly readable. The easiest way to identify these are drop-cap art is never identified as text, rather they are seen as a small image, blocks of text are copied as multiple lines of text, not a single string so each paragraph has to be touched to clean up the text string.
- The final type are simply snapshots taken of the page in question, and converted to an image, these images are then tied together into a PDF. These are REALLY easy to identify, they are 10-20 times the size of a properly converted PDF and when you click on a page, you select the entire page as a graphic.