In Natural Language Processing, search indexing, and document processing, you often need to identify the language in which the document is written before you can do any significant processing. Cedric Champeau has posted a handy Java language tool on his blog that lets you identify the language using an n-gram tokenizer. From his page:
JLangDetect is based on n-gram tokenization. Basically, texts are tokenized with different token sizes. For example, given the text "cat", n-gram tokenization for 1 to 3 token sizes will produce the following tokens :
- c
- a
- t
- ca
- at
- cat
The idea is to tokenize a large set of documents in a given language and record token statistics. When you need to identify a language, then you'll tokenize it the same way, and you'll be able to score the input string against several token stats.
The tool is licensed under the Apache 2.0 license. Links to download the tools are provided on his page. For your convenience:
- Binary : jlangdetect-0.1.jar
- Source: jlangdetect-0.1-sources.jar
- Javadoc: jlangdetect-0.1-javadoc.jar
- Europarl pre-compiled corpus: ngrams-europarl.zip
Comments