Tag Set Managing Module

This module is able to store information about a tagset, and offers some useful functions on PoS tags and morphological features.

This module is internally used by some analyzers (e.g. probabilities module, HMM tagger, feature extraction, ...) but can be instantiated and called by any user application that requires it.

The API of the module is:

class tagset {
 
  public:
    /// constructor: load a tag set description file
    tagset(const std::wstring &f);
    /// destructor
    ~tagset();

    /// get short version of given tag
    std::wstring get_short_tag(const std::wstring &tag) const;

    /// get list of <feature,value> pairs with morphological
    /// information for given tag
    std::list<std::pair<std::wstring,std::wstring> >
              get_msf_features(const std::wstring &tag) const;

    /// get list <feature,value> pairs with morphological 
    /// information, in a string format
    std::wstring get_msf_string(const std::wstring &tag) const;
};

The class constructor receives a file name with a tagset description. Format of the file is described below. The class offers two services:

  1. Get the short version of a tag. This is useful for EAGLES tagsets, and required by some modules (e.g. PoS tagger). The length of a short tag depends on the language and part-of-speech, and the criteria to select it is usually to have a tag informative enough (capturing relevant features such as category, subcategory, case, etc) but also general enough so that significative statistics for PoS tagging can be acquired from reasonably-sized corpora.
  2. Decompose a tag into a list of pairs feature-value (e.g. gender=masc, num=plural, case=dative, etc). This can be retrieved as a list of string pairs, or as a formatted string.



Subsections
Lluís Padró 2013-09-09