Search Components

March 30, 2009

CATMaker: Summarization Tool for Medical Text

Cat-maker CATMaker is an interesting tool for medical text.  It helps generate a summary of a case and a "bottom line" assessment.
http://www.cebm.net/index.aspx?o=1216

It's produced by the Center for Evidence Based Based Medicine (CEBM).  It might be open source.

Summary tools can be very helpful when used with search engines.  They can help give a better summary in a results list, or when used at index time, give the search engine a smaller, more relevant set of text to search over.

March 24, 2009

Java HTML Parsers

Before you index your content into a search engine, you've gotta parse it.

Since many HTML pages either don't conform to the strict XHTML standards, or have blatant typos, you typically can't use JDOM or other XML tools to edit them.  And there are some HTML pages that even JTidy can fix.

Here are some other resources:

Open source list of HTML Parsers
http://java-source.net/open-source/html-parsers

Java's own HTML parser (surprisingly, part of the Swing UI kit)

And one related link with a ton of examples speaking HTTP
http://jan.newmarch.name/distjava/http/lecture2.html

March 15, 2009

Open Source Word Lists

http://wordlist.sourceforge.net

Including variations for spelling correction, inflection, British vs. American English, parts of speech, jargon and the "12 dictionaries" package.

March 11, 2009

WordNet: An Open Source Thesaurus

The fine folks at Princeton have an online database of synonyms that can also be used with your own code.  This is pretty amazing, given how carefully guarded most of the traditional thesaurus providers have been.

See the main site, the very generous license (free for commercial use and modification under reasonable conditions), doc, stats and download.

Of course for a search application you'll need to actually do something with it.  There is a WordNet analyzer listed in the Lucene sandbox, though the link is broken.  But we have it on good authority that it does work...

Here's an article on stemming and lemmatization that includes info on thesaurus based searching, among other things.

February 03, 2009

Diagnose your own Search Problems with a genuine Dr. Search Mug

Drsearch266143946v1_150x150_front The lovable and informative Dr. Search has gone all Hollywood now with a line of mugs and mouse pads.  Perfect for that busy SCOE team lead.

January 23, 2009

E-Retailing / eTailing / eCommerce Vendors Directory

Internet_retailing_logo The fine folks at Internet Retailer have a really nice list of vendors.  There are specific sections for Search Engine Marketing and Site Search Solutions.

http://www.internetretailer.com/E-Commerce/vendor_list.asp

January 13, 2009

PDF Box - Text Extraction and Highlighting

PDF-Box-Logo PDFBox is an API for extracting and highlighting  text from Adobe Acrobat documents (AKA PDF files).

Text extraction: (from their site)

PDFBox is an open source Java PDF library for working with PDF documents. This project allows creation of new PDF documents, manipulation of existing documents and the ability to extract content from documents. PDFBox also includes several command line utilities.

PDF Highlighting (from their site)

There are cases when you might want to highlight text in a PDF document. For example, if the PDF is the result of a search request you might want to highlight the word in the resulting PDF document. There are several ways this can be achieved, each method varying in complexity and flexibility.

http://www.pdfbox.org/userguide/highlighting.html

December 15, 2008

Predefined Database Models for Various KM Systems

Dba_400_home_2 One of our readers points out a set of interesting database models.  For example #90 is for Science and Research applications and handles various Meta Data fields, and #94 is a template for searching Newspaper Articles.

The main list of database models is at:
http://www.databaseanswers.org/data_models/

December 03, 2008

Source Code Search Engine: Krugle

Krugle_logo Krugle actually searches through your Java, C and C# source code, inside all of you company's source code control systems (like Visual Source Safe, CVS, Subversion, etc), to help programmers find specific or similar pieces of code, so that coders are encouraged to reuse existing code, vs. reinventing the wheel time after time.  They have native parsers for dozens of programming languages, and the searcher can adjust how exact a match they want, with krugle understanding things like for-loops and if-then-else constructs from many languages.  We saw their pitch at ESS West and thought it looked pretty useful, espcially for larger coding shops.

From their site:

Krugle Enterprise creates a comprehensive, searchable library of all the source code and related information in your organization. It provides answers to costly code maintenance and development problems previously unsolvable because of information boundaries around source code.

Krugle Enterprise eliminates unwanted code duplication and makes developers more proficient with existing code. This results in significant time to market, quality and cost advantages.

http://www.krugle.com

November 24, 2008

RedDot includes Query Language API for its CMS

Reddotlogo_2 RedDot offers Content Management System and related tools.  Their Developer Page includes information on their RedDot Query Language.  From their site:

RedDot offers a well documented open API (Application Programmers Interface) to develop additional functionality RedDot. Utilize the number of different tools and extensions to enhance the functionality of the core system, partially created and designed by RedDot partners. The RedDot Query Language (RQL) offers a standard method to add to the main functionality of your Content Management project.

http://www.reddot.com/products_development_tools_reddot_query_language.htm

Sponsor