We've recently discovered PDFMiner, a PDF parser and analyzer tool written entirely in Python. Actually, PDFMiner is a suite of tools including a parser, a text renderer, and tools for extracting text. Yusuke Shinyama, who is at NYU, has brought together a number of capabilities in the form of two Python programs: pdf2txt and dumppdf. PDFMiner also supports multi-byte languages with an additional map file. The progams let you find the precise locaiton of test within a PDF, which is a major advantage over many other PDF tools you'll find.
PDFMiner even has a cool PDF to HTML conversion demo page to see what kinds of things you can do with PDFMiner, although some of the PDFs we tried did not display with true fidelity. Nonetheless, the tools provide insight into the kinds of things you
may be able to do with PDF files in your custom pipeline stage, in the OPpenPipelin, or just for poking around.
The Subversion repository is maintained at Google's code.google.com, and you can download the PDFMiner source from www.unixuser.com.
The
Raritan Technologies offers a number of connectors to interface search with multiple existing search engines and give the user a combined results list. They even support Z39.50.
When companies would like to upgrade or replace their search engine, who should they call? It's very easy to predict what each vendor will advise.
From France comes a nice fast XML parser (website in English of course!) This might be handy if you were writing an XML search application.
This program converts between various multimedia audio and video file formats. This might be useful for search engines trying to do audio mining (speech to searchable text)
Microsoft and Open Source, that's right!
With this tool you can allow your website visitors to view binary files like MS Word and PDF on your site without them needing to download and launch a viewer. This could be handy, for example, if your documentation is in PDF but people complain about needing to launch the Adobe viewer.