anotherbyte.netanother byte

Extracting Plain Text for Indexing

Published
16 Jun 2008
Updated
16 May 2009

Searching by keyword requires an index (if you don't want to do it dynamically).

An index requires plain text. And there are a lot of formats out there that are not plain text, especially PDF.

Here are some ways to extract plain (possibly formatted) text from a pdf document:

There are also plenty of shareware and strictly commercial products out there.

blog comments powered by Disqus