Extracting Plain Text for Indexing
Searching by keyword requires an index (if you don't want to do it dynamically).
An index requires plain text. And there are a lot of formats out there that are not plain text, especially PDF.
Here are some ways to extract plain (possibly formatted) text from a pdf document:
- CodeProject: Using C, directly accessing a pdf using the standard.
- CodeProject: Using C#, utilising iTextSharp.
- Text Mining Tool: A command line tool (http://text-mining-tool.com/) which can extract plain text from a number of file formats.
- PDFBox: open source Java library.
There are also plenty of shareware and strictly commercial products out there.

