Language Identifier Module

This module is somehow different of the other modules, since it doesn't enrich the given text. It compares the given text with available models for different languages, and returns the most likely language the text is written in. It can be used as a preprocess to determine which data files are to be used to analyze the text.

The API of the language identifier is the following:

class lang_ident {
  public:
    /// Build an empty language identifier.
    lang_ident();
    /// Build a language identifier, read options from given file.
    lang_ident(const std::wstring &);
    /// load given language from given model file, add to existing languages.
    void add_language(const std::wstring&);
    /// train a model for a language, store in modelFile, and add 
    /// it to the known languages list.
    void train_language(const std::wstring &, const std::wstring &, 
                        const std::wstring &);
    /// Classify the input text and return the code of the best language (or "none")
    std::wstring identify_language (
                    const std::wstring&, 
                    const std::set<std::wstring> &ls=std::set<std::wstring>()) const; 
    /// fill a vector with sorted probabilities for each language
    void rank_languages (
               std::vector<std::pair<double,std::wstring> > &, 
               const std::wstring &,
               const std::set<std::wstring> &ls=std::set<std::wstring>()) const;
};

Once created, the language identifier may be used to get the most likely language of a text (identify_language) or to return a sorted vector of probabilities for each language (rank_languages). In both cases, a set of languages to be considered may be supplied, telling the identifier to apply to the input text only models for those languages in the list. An empty list is interpreted as ``use all available language models''. The language list parameter is optional in both identification methods, and defaults to the empty list.

The same lang_ident class may be used to train models for new languages. The method train_language will use a plain text file to create a new model, which will enlarge the identifier's language repertoire, and will be stored for its use in future instances of the class.

The constructor expects a configuration file name, containing information about where are the language models located, and some parameters. The contents of that file are described below.



Subsections
Lluís Padró 2013-09-09