5E Character Create Playlist
Page 1 of 2 12 Last
  1. #1

    Text from PDF (poor OCR)

    Anyone having any luck getting text out of older modules in pdf format?
    No matter what I try the text is so poorly formatted it would be quicker to type it all from scratch.
    Tried various OCR option in Acrobat XI and best i can do i 85% accurate but a buttload of double spaces!!

    [MODERATOR - removed request for copyright data. Please don't request copyright protected data on these forums - even if you own the original product. Thanks.]
    Last edited by Trenloe; June 25th, 2018 at 00:31.
    ~Grimm182~ (GMT-8)/WA
    GM: Booked
    Player: Available for Sunday Nights

  2. #2
    Zacchaeus's Avatar
    Join Date
    Dec 2014
    Even the best quality pdf’s Will have multiple double spaces. Copy into something like notepad++ first and clean up the text using search/replace before pasting into FG.
    If there is something that you would like to see in Fantasy Grounds that isn't currently part of the software or if there is something you think would improve a ruleset then add your idea here https://www.fantasygrounds.com/featu...rerequests.php

  3. #3
    Quote Originally Posted by Zacchaeus View Post
    Even the best quality pdf’s Will have multiple double spaces. Copy into something like notepad++ first and clean up the text using search/replace before pasting into FG.
    I've actually found that fixing it in bulk from the db.xml file is a better method for me. Just search and replace double spaces in the xml file. I hate having to copy from the PDF then drop it into notepad then copy/paste that into FG. This way I just copy/paste it all into FG and then fix the db.xml when I'm completely done.
    Fantasy Grounds AD&D Reference Bundle, AD&D Adventure Bundle 1, AD&D Adventure Bundle 2
    Documentation for AD&D 2E ruleset FGU Reference Module, or Web.
    Custom Maps (I2, S4, T1-4, Barrowmaze,Lost City of Barakus)
    Note: Please do not message me directly on this site, post in the forums or ping me in FG's discord.

  4. #4

    Join Date
    May 2016
    Jacksonville, FL
    Blog Entries
    I'm with Zacc. I always export PDF to plain text (or accessible if the plain text is botched). If it comes down to copying direct from the PDF, something has gone horribly awry and that's my second-to-last option before just re-typing it myself from scratch. I have 3 or 4 search/replaces setup in Notepad++ (I could probably turn them into a macro if I could be bothered to) that cleans up 99% of the textual issues (assuming the exported text was in a usable state). But I'm always working with "newer" PDFs not old scanned "pictures turned into a PDF" which are just a hot mess to begin with, nevermind getting the text out of them.

  5. #5
    Since this is something that A LOT of people do for Fantasy Grounds, think there is way to automate this somehow?
    Some sort of plug-in or extension?
    Or anyone want to do a step by step instruction?

    Good to hear feedback on how others do it, so thanks for everyone chiming in!
    ~Grimm182~ (GMT-8)/WA
    GM: Booked
    Player: Available for Sunday Nights

  6. #6

    Join Date
    May 2016
    Jacksonville, FL
    Blog Entries
    Quote Originally Posted by grimm182 View Post
    Since this is something that A LOT of people do for Fantasy Grounds, think there is way to automate this somehow?
    Could you elaborate on what part of the discussion qualifies as "this?"

    Both Zacc and I use Notepad++ to work on the raw text files. I then work in raw XML to build my DLC whereas Zacc works in a separate plain text file to create the markdown the PAR5E (not available to the public) tool uses to create its XML output.

    I'm not sure if the free Acrobat Reader has the File -> Save As... text export; I use Acrobat Pro but there are plenty of third-party PDF readers and editors that will do a plain text export.

  7. #7
    To clarify "this", i meant that FG users possibly quite often take text from PDFs and convert them over...It would be nice if there was a community tool for this.

    That being said i am trying the above methods out and see how things pan out...so thanks again.
    ~Grimm182~ (GMT-8)/WA
    GM: Booked
    Player: Available for Sunday Nights

  8. #8
    LordEntrails's Avatar
    Join Date
    May 2015
    -7 UTC
    Blog Entries
    The problem with converting PDFs has nothing to do with FG. It has everything to do with with the quality of the original file (which is in PDF). Any tool (such as Par5E, NPC Engineer, Spell Engineer, etc.) is going to have the same issue: "garbage in, garbage out."

    You have to get the content cleaned up after extracting it from the PDF. That has nothing to do with FG, and everything to do with OCR. You can look at tools like these, I've never tried them.

    Problems? See; How to Report Issues, Bugs & Problems
    On Licensing & Distributing Community Content
    Community Contributions: Gemstones, 5E Quick Ref Decal, Adventure Module Creation, Dungeon Trinkets, Balance Disturbed, Dungeon Room Descriptions
    Note, I am not a SmiteWorks employee or representative, I'm just a user like you.

  9. #9
    JohnD's Avatar
    Join Date
    Mar 2012
    Johnstown ON
    Blog Entries
    You are at the mercy of whether or not the person who put the PDF together did a good job.
    "I am a Canadian, free to speak without fear, free to worship in my own way, free to stand for what I think right, free to oppose what I believe wrong, or free to choose those who shall govern my country. This heritage of freedom I pledge to uphold for myself and all mankind."

    - John Diefenbaker

    RIP Canada, February 21, 2022

  10. #10

    Join Date
    May 2016
    Jacksonville, FL
    Blog Entries
    There are tools such as the Author extension available on the forums (I have not yet tried it), or the Savage Worlds Extended Library (SWEL) which I have used, but mostly I just stick with Notepad++ because I have absolute control over the quality of the output that way.

    However, as @LordEntrails noted, regardless how we accomplish things, we require quality source material to start with. Some PDFs are built in downright amateurish and irresponsible manners, or the worst are the cases of someone scanning or taking photos of pages then releasing those as a PDF. Those are usually pirate copies, though occasionally you'll get official reprints of products so old they're out of print and the dubious scan quality was the best the publisher could do in a bad situation.

Thread Information

Users Browsing this Thread

There are currently 1 users browsing this thread. (0 members and 1 guests)


Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts

Log in

Log in