Before you index your content into a search engine, you've gotta parse it.
Since many HTML pages either don't conform to the strict XHTML standards, or have blatant typos, you typically can't use JDOM or other XML tools to edit them. And there are some HTML pages that even JTidy can fix.
Here are some other resources:
Open source list of HTML Parsers
http://java-source.net/open-source/html-parsers
Java's own HTML parser (surprisingly, part of the Swing UI kit)
- javax.swing.text.html.HTMLEditorKit
- http://www.java2s.com/Code/Java/Development-Class/HTMLDocumentElementIteratorExample.htm
- http://www.java2s.com/Tutorial/Java/0120__Development/UsejavaxswingtexthtmlHTMLEditorKittoparseHTML.htm
- http://www.java2s.com/Tutorial/Java/0120__Development/ParseHTML.htm
- http://www.rgagnon.com/javadetails/java-0424.html
- http://www.rgagnon.com/javadetails/java-0639.html
- http://forums.java.net/jive/message.jspa?messageID=58806
- http://java.sun.com/products/jfc/tsc/articles/bookmarks/
And one related link with a ton of examples speaking HTTP
http://jan.newmarch.name/distjava/http/lecture2.html
FreewareFiles.com
shows simply how to remove a word using the replace method (based on our favorite, regex), while his second post has a great little function for searching for text that uses a recursive search for nodes that contain a pattern. It includes a cool demo of its capabilities as a link right under the code listing.
The 

These folks are offering quite a few free and low cost file conversion tools for PDF, which would be handy to have if you were writing a web spider that needed to handle PDF.