Boston Linux & Unix (BLU) Home | Calendar | Mail Lists | List Archives | Desktop SIG | Hardware Hacking SIG
Wiki | Flickr | PicasaWeb | Video | Maps & Directions | Installfests | Keysignings
Linux Cafe | Meeting Notes | Blog | Linux Links | Bling | About BLU

BLU Discuss list archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Fw: What laser printers do you like - Ricoh & Linux



James R. Van Zandt wrote:
> I have put together a sizable collection of IEEE papers, but they're
> image-only PDFs, making them hard to search.
> 
> Is there a convenient way to add the metadata to the PDF files
> themselves, along with (say) a hand-typed abstract and OCR of the
> rest, so the whole thing can be indexed by something like beagle
> <http://beaglewiki.org/Main_Page>?  
> 
>               - Jim Van Zandt

I would start by running pdftotext on them, then using regular
expressions to pull metadata out of the text versions.

Oddly enough, this is the basis of one of the projects I'm working on at
 Aptima.  Pulling metadata from information coming from many sources in
many formats, tracking the metadata, and grouping documents into that
metadata.




BLU is a member of BostonUserGroups
BLU is a member of BostonUserGroups
We also thank MIT for the use of their facilities.

Valid HTML 4.01! Valid CSS!



Boston Linux & Unix / webmaster@blu.org