Number Detection Module

The number detection module is language dependent: It recognizes nummerical expression (e.g.: 1,220.54 or two-hundred sixty-five), and assigns them a normalized value as lemma.

The module is basically a finite-state automata that recognizes valid nummerical expressions. Since the structure of the automata and the actions to compute the actual nummerical value are different for each lemma, the automata is coded in C++ and has to be rewritten for any new language.

For languages that do not have an implementation of a specific automata, a generic module is used to recognize number-like expressions that contain nummerical digits.

There is no configuration file to be provided to the class when it is instantiated. The API of the class is:

  
class numbers {
  public:
    /// Constructor: receives the language code, and the decimal 
    /// and thousand point symbols
    numbers(const std::string &, const std::string &, const std::string &); 

    /// analyze given sentence.
    void analyze(sentence &) const;
    /// analyze given sentences.
    void analyze(std::list<sentence> &) const;
    /// return analyzed copy of given sentence
    sentence analyze(const sentence &) const;
    /// return analyzed copy of given sentences
    std::list<sentence> analyze(const std::list<sentence> &) const;
};

The parameters that the constructor expects are:

The last two parameters are needed because in some latin languages, the comma is used as decimal point separator, and the dot as thousand mark, while in languages like English it is the other way round. These parameters make it possible to specify what character is to be expected at each of these positions. They will usually be comma and dot, but any character could be used.

Lluís Padró 2013-09-09