Software the converts binary files into plain text that a search engine can then index. For example, convert an MS Word file into text so it can be indexed by Lucene. Some filters can also convert into HTML for generic web viewing without a plugin.
Five Main Sources
- The 3 top commercial filters being used now are:
- 1: Stellent
- Now part of Oracle, who now sells search
- 2: KeyView
- Now part of Autonomy, a Tier-1 search vendor
- 3: Microsoft IFilter Framework and utilities
- Packaged as parts of other Microsoft products.
- And then two other general choices:
- 4: Smaller Companies with Commercial Filters
- We're trying to list them all here - send us feedback!
- 5: Open Source projects
- Some assembly required... If we've missed any links, please let us know!
Looking for the Old Guard?
For those of you who've been around the industry for a while, there are a couple other names you might vaguely remember, but through mergers and acquisitions these old guard players are either gone, relabeled or otherwise subsumed into other offerings:
- INSO / OutsideIn
- Now part of Oracle via Stellent acquisition.
- Mastersoft
- Now also part of Oracle, by a rather circuitous route: Mastersoft was bought by Frame, who was then bought by Adobe. The filters were then sold to Inso, and so they are now also part of Oracle via the Stellent acquisition.
Microsoft's IFilter Framework
Years ago Microsoft needed filters for one of its own early search engine efforts, for the original Search Server product and later for the Content Indexing Service. The early focus was on Microsoft Office related formats, but other formats have been added over the years by both Microsoft and 3rd party vendors. The formal IFilter Framework became well established when Microsoft used it in its own MSN Desktop Search offering.
ALSO, because the Office 2007 documents are just compressed XML, and Microsoft has a tool that can automatically convert older office documents into the newer XML format, there is another option from Redmond as well.
The basic filters are "free", since the DLLs ship with many Microsoft products. But free doesn't always mean "open source". We haven't lumped them in with open source filters because there is a real company maintaining them, and they are only shipped in binary form. In contrast, open source software includes the source code, which in fact is the "source" in open source.
The good news is that the filters do work and are widely available, and the price is right!
However, potential issues include:
- Complexity of integration. Using the filters with your software will require some investigation and some coding.
- Somewhat limited document format support. You'll need to check your requirements against the list for formats; in some cases 3rd party vendors may have the filter you need that will plug-in to the IFilter framework.
- Tightly linked to the Windows operating system, which may or may not be an issue, depending on your OS requirements. We're not sure if there is any chance of Linux support for IFilters, although some .net components do run on non-Windows operating systems.
- Licensing and distribution rights. We have not investigated this thoroughly, but the safest course of action might be to have customers separately install a Microsoft product that includes the filters, and then install your product. But we're not lawyers, so you'll need to do your own homework.
Delving into this subject further is beyond the scope of this article, but here are some links for the curious:
- http://channel9.msdn.com/wiki/default.aspx/Channel9.IFilter
- http://channel9.msdn.com/wiki/default.aspx/Channel9.DesktopSearchIFilters
- http://www.ifilter.org/
- http://www.citeknet.com/
- http://blogs.msdn.com/michkap/archive/2005/03/08/389675.aspx
Other Commercial and Open Source Filters
There are other options out there, both commercial and open source. But there doesn't seem to be any convenient direct replacement for Stellent or KeyView.
Disclaimer: The rest of this article is taken from my raw notes, so it is a bit terse, may be out of date, and possibly wrong.
Commercial
- Commercial / General
- Davisor Offisor 4.1
Java Based, converts many formats into XML
http://www.davisor.com/offisor/index.html
Filtrix??
- Blueberry Software Filtrix
http://www.blueberry.com
To/from many publishing oriented formats; missing some Office formats such as Excel. Does handle FrameMaker MIF files. Can output to HTML. Windows and Solaris. Oddly, no support in/out for XML or PDF.
http://www.blueberry.com/formats.htm
- WordPort
From Ascii.com (http://www.acii.com/wpt.htm), also claims to read various Multimate formats. - LogicTran
http://www.logictran.com/index.html#r2net - YAWC Pro
http://www.yawcpro.com/ - Defunct? WvWare
Filter for Microsoft Word documents (old)
http://www.wvware.com
- Davisor Offisor 4.1
- Commercial Uni-Format
- Antenna House to/from PDF/XML (Japan)
Seems to only handle NEW XML-based Word format, WordML
XML and PDF munging
http://www.antennahouse.com/aboutus.htm - DocSoft Word to XML
W2XML Word to XML, even DocBook.org format
http://www.docsoft.com/w2xmlv2.htm
W2XML may have been called Wordplay at some point in the past. - Infinity Loop upCast
Microsoft Word only, to/from XML; upCast and downcast respectively.
http://www.infinity-loop.de/products/upcast/index.html
Some good technical info
http://www.infinity-loop.de/products/upcast/support.html
- Antenna House to/from PDF/XML (Japan)
- Commercial PDF
- XPump
NIE's own PDF to XML reader (part of XPump) - Some email based converter?
http://preprints.cern.ch/Convert?emailGuide - $65 shareware
http://www.sanface.com/jpg2pdf.html - Java Classes for PDF, free, "big faceless"
http://big.faceless.org/products/pdf/index.jsp
- XPump
Open Source
- Open Source / General
- AnitWord
http://www.winfield.demon.nl/ - Abiword command line conversion
http://linuxhelp.blogspot.com/2005/08/use-abiword-to-convert-filetypes-on.html
http://en.wikipedia.org/wiki/AbiWord
http://www.abisource.com/
http://portableapps.com/apps/office/word_processors/portable_abiword - Lius Lucene Index Update and Search
http://sourceforge.net/projects/lius
- AnitWord
- Open Source Office Formats
- MS Word document format
http://en.wikipedia.org/wiki/Microsoft_Word#File_formats
http://visualbasic.about.com/od/learnvba/l/blecvbai0204.htm - Open Office developer page
Article to convert from the command line
http://www.xml.com/pub/a/2006/01/11/from-microsoft-to-openoffice.html
http://development.openoffice.org/
http://wiki.services.openoffice.org/mwiki/index.php?title=Filter&action=edit - Jakarta POI Java API for Microsoft formats
For OLE 2 Compound Document Format
HWPF for Word Documents
HPSF for Document Properties
http://jakarta.apache.org/poi/
http://jakarta.apache.org/poi/hwpf/index.html - Koffice Microsoft Word Filter KWord
http://www.koffice.org/filters/1.4/kword/msword97.php - Doc2XML (in Python)
http://pair.mbl.ca/doc2xml/ - Nutch MS Word (Lucene)
http://lucene.apache.org/nutch/apidocs/org/apache/nutch/parse/msword/package-summary.html
http://lucene.apache.org/nutch/apidocs/org/apache/nutch/parse/msword/chp/package-summary.html - EgoThor Microsoft Office formats
http://egothor.sourceforge.net/documentation/api/msft/ - Open Office
early on was xmerge to get away from licensing ... some relation to Inso
Proposed http://xml.openoffice.org/xmerge/docs/XMerge_sdk.pdf
For small devices http://xml.openoffice.org/xmerge/
Open office java port for Mac
http://neowiki.sixthcrusade.com/index.php/NeoOffice/J_File_Formats - Apache Office Format Project
This is the leading candidate for OLE2 Compound Documents http://jakarta.apache.org/poi/ - TEXT Mining (Word)
http://www.textmining.org/modules.php?op=modload&name=Downloads&file=index&req=viewdownload&cid=2
Part of Apache, uses some POI for some MS formats, better than straight POI - Suggested to also see code from wais and digg oxml tools
- MS Word document format
- Open Source HTML / XML
- Nutch HTML to DOM
http://lucene.apache.org/nutch/apidocs/org/apache/nutch/parse/html/package-summary.html - Tidy / JTidy
http://sourceforge.net/projects/jtidy
- Nutch HTML to DOM
- Open Source PDF
- Nutch PDF
http://lucene.apache.org/nutch/apidocs/org/apache/nutch/parse/pdf/package-summary.html - EgoThor PDF
http://egothor.sourceforge.net/documentation/api/pdf/ - another PDF
http://www.pdfbox.org - XPDF and PDF2HTML
- XPDF
http://www.foolabs.com/xpdf/ - PDF2HTML
http://pdftohtml.sourceforge.net/ - Jpedal
http://www.jpedal.org/
- Nutch PDF
Excerpted with permission from NIE article "Where have all the Filters gone?".