Before you index your content into a search engine, you've gotta parse it.
Since many HTML pages either don't conform to the strict XHTML standards, or have blatant typos, you typically can't use JDOM or other XML tools to edit them. And there are some HTML pages that even JTidy can fix.
Here are some other resources:
Open source list of HTML Parsers
http://java-source.net/open-source/html-parsers
Java's own HTML parser (surprisingly, part of the Swing UI kit)
- javax.swing.text.html.HTMLEditorKit
- http://www.java2s.com/Code/Java/Development-Class/HTMLDocumentElementIteratorExample.htm
- http://www.java2s.com/Tutorial/Java/0120__Development/UsejavaxswingtexthtmlHTMLEditorKittoparseHTML.htm
- http://www.java2s.com/Tutorial/Java/0120__Development/ParseHTML.htm
- http://www.rgagnon.com/javadetails/java-0424.html
- http://www.rgagnon.com/javadetails/java-0639.html
- http://forums.java.net/jive/message.jspa?messageID=58806
- http://java.sun.com/products/jfc/tsc/articles/bookmarks/
And one related link with a ton of examples speaking HTTP
http://jan.newmarch.name/distjava/http/lecture2.html
Comments