Boston Linux & Unix (BLU) Home | Calendar | Mail Lists | List Archives | Desktop SIG | Hardware Hacking SIG
Wiki | Flickr | PicasaWeb | Video | Maps & Directions | Installfests | Keysignings
Linux Cafe | Meeting Notes | Blog | Linux Links | Bling | About BLU

BLU Discuss list archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Fw: What laser printers do you like - Ricoh & Linux



David Kramer wrote:
> James R. Van Zandt wrote:
>> I have put together a sizable collection of IEEE papers, but they're
>> image-only PDFs, making them hard to search.
>>
>> Is there a convenient way to add the metadata to the PDF files
>> themselves, along with (say) a hand-typed abstract and OCR of the
>> rest, so the whole thing can be indexed by something like beagle
>> <http://beaglewiki.org/Main_Page>?  
>>
>>               - Jim Van Zandt
> 
> I would start by running pdftotext on them, then using regular
> expressions to pull metadata out of the text versions.
> 
> Oddly enough, this is the basis of one of the projects I'm working on at
>  Aptima.  Pulling metadata from information coming from many sources in
> many formats, tracking the metadata, and grouping documents into that
> metadata.

These are image-only PDFs; each page of the PDF is simply a big image.
pdftotext won't find any text in them.

I haven't found a good OCR solution for Linux. I have an adequate OCR
package for Windows, but I don't see any way to automate it; each
document has to be processed by hand. And the results are somewhat
adequate as metadata, but you'd still need to review and correct
non-trivial amounts of the resulting text in order to achieve decent
searchability.


-- 
John Abreau
IT Manager
Zuken USA
238 Littleton Rd., Suite 100
Westford, MA 01886
T: 978-392-1777            F: 978-692-4725
M: 978-764-8934
E: John.Abreau at zuken.com  W: www.zuken.com

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 254 bytes
Desc: OpenPGP digital signature
URL: <http://lists.blu.org/pipermail/discuss/attachments/20060717/864950e9/attachment.sig>



BLU is a member of BostonUserGroups
BLU is a member of BostonUserGroups
We also thank MIT for the use of their facilities.

Valid HTML 4.01! Valid CSS!



Boston Linux & Unix / webmaster@blu.org