Before you index your content into a search engine, you've gotta parse it.
Since many HTML pages either don't conform to the strict XHTML standards, or have blatant typos, you typically can't use JDOM or other XML tools to edit them. And there are some HTML pages that even JTidy can fix.
The fine folks at Princeton have an online database of synonyms that can also be used with your own code. This is pretty amazing, given how carefully guarded most of the traditional thesaurus providers have been.
Of course for a search application you'll need to actually do something with it. There is a WordNet analyzer listed in the Lucene sandbox, though the link is broken. But we have it on good authority that it does work...
Here's an article on stemming and lemmatization that includes info on thesaurus based searching, among other things.
CATMaker is an interesting tool for medical text. It helps generate a summary of a case and a "bottom line" assessment. http://www.cebm.net/index.aspx?o=1216
Summary tools can be very helpful when used with search engines. They can help give a better summary in a results list, or when used at index time, give the search engine a smaller, more relevant set of text to search over.
PDFBox is an API for extracting and highlighting text from Adobe Acrobat documents (AKA PDF files).
Text extraction: (from their site)
PDFBox is an open source Java PDF library for working with PDF documents. This project allows creation of new PDF documents, manipulation of existing documents and the ability to extract content from documents. PDFBox also includes several command line utilities.
There are cases when you might want to highlight text in a PDF document. For example, if the PDF is the result of a search request you might want to highlight the word in the resulting PDF document. There are several ways this can be achieved, each method varying in complexity and flexibility.
The good folks at Parse-O-Matic are offering a free tool for file conversions that includes a powerful scripting language that enables you set up automatic conversions for any number of files and let the tool do its work while you move on to other tasks - or just go home early! Check out the sample script language.
Is your file in the wrong format? Instead of rekeying it, reformat it with Parse-O-Matic Free Edition — a flexible, programmable data file converter. Avoid the frustrating restrictions of "point and click" converters that almost do the job; with the Parse-O-Matic Free Edition, your scripts tell the program precisely what you want to do.
Sample applications: Edit a text file automatically. Copy valid data; repair or skip bad data. Rearrange a print file. Expedite migration of legacy systems. Export a comma-separated-value (CSV) file for import to a database (such as Access, SQL Server, FileMaker, Oracle, Paradox, ODBC) or a spreadsheet table (Excel, Calc, Quattro). Select and correct data from a mailing or customer list (first name, last name, street address, city, phone number and so on). Generate mail merge sets. Modify character strings into uppercase, lowercase or mixed case. Calculate totals up to 18 digits long.
Planning a data warehouse system? You can split or reorganize files per your specification (regular expressions supported), or mine printed reports for essential information.
Input formats: Read, extract, analyze and reorganize data fields from flat files such as: text (ASCII from Windows, Unix/Linux or Mac, EBCDIC from a mainframe, plus log files from web servers, process control devices and scientific instruments); binary ("hex"); fixed length and variable length records; tab/null/comma-delimited; Windows clipboard.
Working with large files? The Parse-O-Matic Free Edition can convert, filter and transform files of almost any size. If you have enough room on your hard disk to copy it, then you can probably parse it.
Output formats: Almost any record or file format, including HTML and XML. You can also write text to the Windows clipboard so it can be directly pasted into other applications.
FreewareFiles.com has a nice PDF text extraction tool. From the description:
Ease Pdf to Text Extractor is a free software designed to extract text from Adobe PDF files. It does NOT need Adobe Acrobat software.It processes at very high speed and you can convert multiple PDF files to text files at one time. The program is freeware, which means that you can use it either personally or commercially for free.
With this tool you can allow your website visitors to view binary files like MS Word and PDF on your site without them needing to download and launch a viewer. This could be handy, for example, if your documentation is in PDF but people complain about needing to launch the Adobe viewer.
These folks are offering quite a few free and low cost file conversion tools for PDF, which would be handy to have if you were writing a web spider that needed to handle PDF. http://pdf-to-html-word.com/download.htm then see the download page.
Microsoft's newer document formats from Office 2007, ending in .docx, .xlsx, .pptx, etc, are actually just compressed (zipped) XML. You can take any of those files and just unzip them which will result in a subdirectory. In that subdirectory there are good old XML files, one of which will contain the text of your document. At this point you can use one of the filters for handling XML, such as XSLT, to feed the text into your favorite search engine.
That's all good, but what about older Office 2003 and earlier files, in their proprietary Microsoft formats?
These documents can now be automatically converted into Office 2007 format with the OMPM tool, which means they can be converted into compressed XML, which again opens them up to relatively easy processing.