Boston Linux & Unix (BLU) Home | Calendar | Mail Lists | List Archives | Desktop SIG | Hardware Hacking SIG
Wiki | Flickr | PicasaWeb | Video | Maps & Directions | Installfests | Keysignings
Linux Cafe | Meeting Notes | Blog | Linux Links | Bling | About BLU

BLU Discuss list archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Document scanning and "searchable" PDFs



first reply went off-list

On Sat, May 9, 2009 at 3:34 PM, David Vasconcelos <
david-D010uk0En6A+WRpjb9m9WFaTQe2KTcn/@public.gmane.org> wrote:

> I've got an HP OfficeJet 5610 and I'm interested in finding a
> replacement for the bundled scanner software.
>
> The HP software (Windows) can produce something it calls a "searchable
> PDF."  I really like this format because it's combines an image of the
> document with OCR'd text.
>
> The text gets embedded in such a way that you can select/copy text
> directly from acroread, evince, etc.
>
> I've tried gscan2pdf and it comes pretty close to what I'm looking for.
>  However...
>
> 1. The OCR'd text gets embedded differently, so you can't actually
> select/copy the OCR'd text from a PDF viewer.
>
> 2. The OCR back-ends for gscan2pdf (tesserract and GOCR) seem to have
> trouble with multiple columns of text, or things like pay-stubs where
> the text doesn't flow in paragraphs.  The free HP software seems to
> handle this without a problem.
>
> So, I've been scanning from Windows.  I'd really like to find an
> alternative.
>
> Any suggestions?  Thanks!
>



I've only used XSane or Kooka http://kooka.kde.org/ with the normal OCR
engines.  And it has been a long time since I scanned anything.

After I read this review http://groundstate.ca/ocr , I learned about
OCRopus.  Looks very interesting:

http://code.google.com/p/ocropus/
http://sites.google.com/site/ocropus/install-mercurial


This review http://www.linux.com/archive/articles/57222 explains how
Tesseract (http://code.google.com/p/tesseract-ocr/) from HP, now Google,
changed the landscape and provided high accuracy, but I think it's either
incorporated with, or superceeded by OCRopus

There are commercial applications that can be run on linux/unix, but the
cost is in the thousands of dollars:
http://vividata.com/be_xtr_pricing.html


Please let us know what else you find out.

~ Greg

-- 
Greg Rundlett
Web Developer - Initiative in Innovative Computing
http://iic.harvard.edu
camb 617-384-5872
nbpt 978-225-8302
m. 978-764-4424
-skype/aim/irc/twitter freephile
http://profiles.aim.com/freephile






BLU is a member of BostonUserGroups
BLU is a member of BostonUserGroups
We also thank MIT for the use of their facilities.

Valid HTML 4.01! Valid CSS!



Boston Linux & Unix / webmaster@blu.org