Named Entity Recognition Module

There are two different modules able to perform NE recognition. They can be instantiated directly, or via a wrapper that will create the right module depending on the configuration file.

The API for the wrapper is the following:

class WINDLL ner {
  public:
    /// Constructor
    ner(const std::wstring &);
    /// Destructor
    ~ner();

    /// analyze given sentence
    void analyze(sentence &) const;
    /// analyze given sentences
    void analyze(std::list<sentence> &) const;
    /// analyze sentence, return analyzed copy
    sentence analyze(const sentence &) const;
    /// analyze sentences, return analyzed copy
    std::list<sentence> analyze(const std::list<sentence> &) const;
};

The parameter to the constructor is the absolute name of a configuration file, which must contain the desired module type (basic or bio) in a line enclosed by the tags <Type> and </Type>.

The rest of the file must contain the configuration options specific for the selected NER type, described below.

The basic module is simple and fast, and easy to adapt for use in new languages, provided capitalization is the basic clue for NE detection in the target language. The estimated performance of this module is about 85% correctly recognized named entities.

The bio module, is based on machine learning algorithms. It has a higher precision (over 90%), but it is remarkably slower than basic, and adaptation to new languages requires a training corpus plus some feature engineering.



Subsections
Lluís Padró 2013-09-09