This module is somehow different of the other modules, since it doesn't enrich the given text. It compares the given text with available models for different languages, and returns the most likely language the text is written in. It can be used as a preprocess to determine which data files are to be used to analyze the text.
The API of the language identifier is the following:
class lang_ident { public: /// Build an empty language identifier. lang_ident(); /// Build a language identifier, read options from given file. lang_ident(const std::wstring &); /// load given language from given model file, add to existing languages. void add_language(const std::wstring&); /// train a model for a language, store in modelFile, and add /// it to the known languages list. void train_language(const std::wstring &, const std::wstring &, const std::wstring &); /// Classify the input text and return the code of the best language (or "none") std::wstring identify_language ( const std::wstring&, const std::set<std::wstring> &ls=std::set<std::wstring>()) const; /// fill a vector with sorted probabilities for each language void rank_languages ( std::vector<std::pair<double,std::wstring> > &, const std::wstring &, const std::set<std::wstring> &ls=std::set<std::wstring>()) const; };
Once created, the language identifier may be used to get the most
likely language of a text (identify_language
) or to return a
sorted vector of probabilities for each language
(rank_languages
). In both cases, a set of languages to be
considered may be supplied, telling the identifier to apply to the
input text only models for those languages in the list. An empty
list is interpreted as ``use all available language models''.
The language list parameter is optional in both identification methods,
and defaults to the empty list.
The same lang_ident
class may be used to train models for new
languages. The method train_language
will use a plain text
file to create a new model, which will enlarge the identifier's
language repertoire, and will be stored for its use in future
instances of the class.
The constructor expects a configuration file name, containing information about where are the language models located, and some parameters. The contents of that file are described below.