Boston Linux & Unix (BLU) Home | Calendar | Mail Lists | List Archives | Desktop SIG | Hardware Hacking SIG
Wiki | Flickr | PicasaWeb | Video | Maps & Directions | Installfests | Keysignings
Linux Cafe | Meeting Notes | Blog | Linux Links | Bling | About BLU

BLU Discuss list archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Document scanning and "searchable" PDFs



I've got an HP OfficeJet 5610 and I'm interested in finding a
replacement for the bundled scanner software.

The HP software (Windows) can produce something it calls a "searchable
PDF."  I really like this format because it's combines an image of the
document with OCR'd text.

The text gets embedded in such a way that you can select/copy text
directly from acroread, evince, etc.

I've tried gscan2pdf and it comes pretty close to what I'm looking for.
 However...

1. The OCR'd text gets embedded differently, so you can't actually
select/copy the OCR'd text from a PDF viewer.

2. The OCR back-ends for gscan2pdf (tesserract and GOCR) seem to have
trouble with multiple columns of text, or things like pay-stubs where
the text doesn't flow in paragraphs.  The free HP software seems to
handle this without a problem.

So, I've been scanning from Windows.  I'd really like to find an
alternative.

Any suggestions?  Thanks!

David






BLU is a member of BostonUserGroups
BLU is a member of BostonUserGroups
We also thank MIT for the use of their facilities.

Valid HTML 4.01! Valid CSS!



Boston Linux & Unix / webmaster@blu.org