Before you index your content into a search engine, you've gotta parse it.
Since many HTML pages either don't conform to the strict XHTML standards, or have blatant typos, you typically can't use JDOM or other XML tools to edit them. And there are some HTML pages that even JTidy can fix.
These folks are offering quite a few free and low cost file conversion tools for PDF, which would be handy to have if you were writing a web spider that needed to handle PDF. http://pdf-to-html-word.com/download.htm then see the download page.