PDA

View Full Version : Scanning PDFs to extract text (OCR)



grimm182
November 23rd, 2009, 20:49
I have had mixed results using Acrobat 7 to take text from a pdf adventure/module and exporting it without it being as much a pain to correct as it is to type manually.

I have several modules I would like to convert to FG, but so far this seems to be the biggest time-sink for me. Anyone know of a better program to use? I have even thought that maybe a speech to text program might work.

unerwünscht
November 23rd, 2009, 20:59
Have you considered the possibility of paying some kid in china $2 to do it?

ddavison
November 24th, 2009, 17:29
Why did "all your are base belong to us" suddenly come to mind? ;)

I have had similar results with text. I have found that xPDF occasionally works well with text and generally works very well with extracting images. The nice thing about the image extraction is that it pulls each image separately even if it was layered onto the page within the PDF. The background image will appear as one image and the portrait of the NPC would pull out as a 2nd image, for instance.

grimm182
November 24th, 2009, 18:30
Cool, i just noticed today Irfanview has an option to do OCR...so i will try your suggestion as well. Otherwise i may have to resort to a speech program.
Thanks

Foen
November 24th, 2009, 21:59
I have really struggled with this, even using PDFs with embedded text. Having converted Call of Cthulhu, Rolemaster Classic and (in train) Basic Roleplaying, it is obvious there is no easy answer.

I now use a combination of three methods:

1. Using Acrobat, copy and paste the embedded text to a text file, manually convert to html, and manually convert to an FG module. On the plus side, it beats OCR for accuracy, on the minus side it loses italics/bold/formatting/table layout, and isn't available for older/non-embedded text PDFs. BTW, embedded text can sometimes be different to the displayed (rendered) text. This approach also lends itself to parser-type technology for weapons, spells/powers, creatures etc.

2. Using Abbyy FineReader to OCR the PDF and create html output. On the plus side it deals quite well with table layout, converts italics/bold well, and deals with older (image) PDFs. On the minus side it actually turns neat vector fonts into BMPs, then tries to OCR them, so you have to correct the character recognition manually. Yuck!

3. Re-typing and recreating the source data. On the plus side it can save time and frustration on the other methods, on the minus side you are guaranteed to have human transliteration error and it may not save any time or frustration.

Having spent 3-4 years on product conversion, it is sobering to think there is no easier solution. I use a mixture because each has merits and flaws and none work well for all circumstances.

Just my 2c, and I'd be delighted for someone to show me an easier path.

Stuart

Thore_Ironrock
November 25th, 2009, 00:03
The reason for the poor results is the method in which the product was laid out, then converted to PDF. Program like Pagemaker and Quark are the worst, while Indesign gives slightly better results.

Tenian
November 25th, 2009, 11:44
For documents that were designed to be PDFs (i.e. not just a collection of scanned images), Zeph has had a lot of success using a professional version of Acrobat to export to XML and process that.

However, that's not a particularly cheap solution and you still have to do some manipulation of the output (not to mention converting it into a module).

Foen
November 25th, 2009, 15:01
For documents that were designed to be PDFs (i.e. not just a collection of scanned images), Zeph has had a lot of success using a professional version of Acrobat to export to XML and process that.

That's interesting - I have Acrobat and will need to look into it. If it saves any time it will be worth doing!

grimm182
November 25th, 2009, 15:05
Foen, what version of Acrobat will you use? I am curious if you have a version higher thatn 7.0 if it makes any improvement.

Maybe i can send you an output done with 7 and compare...if you have a higher verison...maybe a single page example.

Zeus
November 25th, 2009, 15:40
As Tenian pointed out for those interested there is an another way ...

XML and XSL transformations.

Firstly you need to convert the PDF to an XML format. Similar to conversion to HTML there are various methods for doing this however I found the XML output option in Adobe's Acrobat 9 Professional Extended version to be the most consistent.

The quality and correctness of this conversion is heavily dependant upon the Tag status of the PDF. Tags are used like markup to define headers, text, figures, tables etc so that PDF readers with Accessibility options can re-flow the text, adjust images, tables etc to suit the readers interface.

Adobe provides a standard set of Tags which if properly used allow for easy conversion of the PDF to another format like XML or HTML.
However not all authoring tools comply with the standard set or make proper use of the tag structure, therefore its possible to convert but you lose the metadata around the text e.g. header, bold, italic typefaces etc. Foen - I suspect this is why you lose the markup when you convert to HTML.

The good news is that if the PDF and Tag structure is sound, its trivial to convert to XML (use the save As or Export to options in Acrobat 9 Pro E). All thats left is to prepare an XSL stylesheet to convert the extported XML version to the target XML schema, in this case perhaps the 4E_JPG module schema.

I have already written (and posted) several XSL's which cope quite well with the Scales of War Adventure Path material as published in WotC Dungeon magazine and other Dragon magazine content as well as the RPGA modules. The output from these XSL's can be directly fed (with minor levels of editing required) into Tenian's 4EParser tool. So its possible to produce SoW/RPGA modules in a much shorter time frame than the manual approach.

The XSL's can be found at:https://zgp.eugenez.net/XSL/

I also have another XSL which simply dumps the text from a WotC rulebook which has been tagged/exported to XML. Unfortunatley the authoring tools that WotC now use to produce the original source documents don't do a great job of using a consistent set of tags. The results are therefore basic and require heavy editiing.

Still it makes much shorter work than the laborious cut n paste approach.

As I exclusively play 4E D&D I have not tried this approach with other Game PDFs but it should work subject to a correctly defined XSL stylesheet.

PDFs that are not tagged already, will require Tags to be either automatically added (quality of this depends on the original authorising software used to produce the PDF) or manually, in which case it out-weighs the benefit as you will be manually traversing the PDF, selecting text and assinging it its tag. At this point the exercise becomes futile as you may as well manually cut n paste as per traditional or current approaches.

Still for those PDFs that are tagged properly, XML and XSL transformation is another approach that can save a hell of a lot of time.

Original thread: https://www.fantasygrounds.com/forums/showthread.php?t=10783&highlight=scales

grimm182
November 25th, 2009, 15:50
excellent feedback, im thinking a "best practice" method for this would boost the amount of module remakes, which is win-win for all of us...more time GM'ing less time typing.

Zeus
November 25th, 2009, 17:40
Another solution for 3.5 D&D is for someone to write an XSL thats converts RPGExplorer's dataset files (XML) into a compatible FGII d20/3.5e/d20_JPG module.

All of the hard work of transcribing the plethora of 3/3.5e PDFs into RPGExplorer datasets has already been achieved. Seems like a good option to rapidly producing 3.5e modules.

See: https://www.dndjunkie.com/rpgxdatasets.aspx

ddavison
November 25th, 2009, 19:01
As Tenian pointed out for those interested there is an another way ...

XML and XSL transformations.

Firstly you need to convert the PDF to an XML format. Similar to conversion to HTML there are various methods for doing this however I found the XML output option in Adobe's Acrobat 9 Professional Extended version to be the most consistent.

The quality and correctness of this conversion is heavily dependant upon the Tag status of the PDF. Tags are used like markup to define headers, text, figures, tables etc so that PDF readers with Accessibility options can re-flow the text, adjust images, tables etc to suit the readers interface.

Adobe provides a standard set of Tags which if properly used allow for easy conversion of the PDF to another format like XML or HTML.
However not all authoring tools comply with the standard set or make proper use of the tag structure, therefore its possible to convert but you lose the metadata around the text e.g. header, bold, italic typefaces etc. Foen - I suspect this is why you lose the markup when you convert to HTML.

The good news is that if the PDF and Tag structure is sound, its trivial to convert to XML (use the save As or Export to options in Acrobat 9 Pro E). All thats left is to prepare an XSL stylesheet to convert the extported XML version to the target XML schema, in this case perhaps the 4E_JPG module schema.

I have already written (and posted) several XSL's which cope quite well with the Scales of War Adventure Path material as published in WotC Dungeon magazine and other Dragon magazine content as well as the RPGA modules. The output from these XSL's can be directly fed (with minor levels of editing required) into Tenian's 4EParser tool. So its possible to produce SoW/RPGA modules in a much shorter time frame than the manual approach.

The XSL's can be found at:https://zgp.eugenez.net/XSL/

I also have another XSL which simply dumps the text from a WotC rulebook which has been tagged/exported to XML. Unfortunatley the authoring tools that WotC now use to produce the original source documents don't do a great job of using a consistent set of tags. The results are therefore basic and require heavy editiing.

Still it makes much shorter work than the laborious cut n paste approach.

As I exclusively play 4E D&D I have not tried this approach with other Game PDFs but it should work subject to a correctly defined XSL stylesheet.

PDFs that are not tagged already, will require Tags to be either automatically added (quality of this depends on the original authorising software used to produce the PDF) or manually, in which case it out-weighs the benefit as you will be manually traversing the PDF, selecting text and assinging it its tag. At this point the exercise becomes futile as you may as well manually cut n paste as per traditional or current approaches.

Still for those PDFs that are tagged properly, XML and XSL transformation is another approach that can save a hell of a lot of time.

Original thread: https://www.fantasygrounds.com/forums/showthread.php?t=10783&highlight=scales

I imagine that a lot of publishers would already own the full version of Adobe Acrobat. There is a free 30-day trial of Adobe Acrobat 9 Pro Extended here: https://www.adobe.com/products/acrobatproextended/tryout.html

If it works really well, maybe I'll just buy it and then provide that quick service for people wanting to complete conversions. It could significantly reduce the time to market for each of these conversions.

grimm182
November 25th, 2009, 21:21
That sounds very promising, otherwise i wonder how many i could convert in 30 days! hehe prolly more than im willing to run.