Part-of-Speech Tagger Module
There are two different modules able to perform PoS tagging. The
application should decide which method is to be used, and
instantiate the right class.
The first PoS tagger is the hmm_tagger class, which is a
classical trigam Markovian tagger, following [Bra00].
The second module, named relax_tagger, is a hybrid system
capable to integrate statistical and hand-coded knowledge, following
[Pad98].
The hmm_tagger module is somewhat faster than relax_tagger, but the later allows you to add manual constraints
to the model. Its API is the following:
class hmm_tagger: public POS_tagger {
public:
/// Constructor
hmm_tagger(const std::string &, bool, unsigned int, unsigned int kb=1);
/// analyze given sentence.
void analyze(sentence &) const;
/// analyze given sentences.
void analyze(std::list<sentence> &) const;
/// return analyzed copy of given sentence
sentence analyze(const sentence &) const;
/// return analyzed copy of given sentences
std::list<sentence> analyze(const std::list<sentence> &) const;
/// given an analyzed sentence find out probability
/// of the k-th best sequence
double SequenceProb_log(const sentence &, int k=0) const;
};
The hmm_tagger constructor receives the following parameters:
- The HMM file, which containts the model parameters.
The format
of the file is described below. This file can be generated from a
tagged corpus using the script src/utilities/train-tagger/bin/TRAIN.sh
provided in FreeLing package. See src/utilities/train-tagger/README
to find out the details.
- A boolean stating whether words that carry retokenization
information (e.g. set by the dictionary or affix handling modules)
must be retokenized (that is, splitted in two or more words) after
the tagging.
- An integer stating whether and when the tagger must select only
one analysis in case of ambiguity. Possbile values are: FORCE_NONE (or 0): no selection forced, words ambiguous after
the tagger, remain ambiguous. FORCE_TAGGER (or 1): force
selection immediately after tagging, and before retokenization. FORCE_RETOK (or 2): force selection after retokenization.
- An integer stating how many best tag sequences the tagger must
try to compute. If not specified, this parameter defaults to 1.
Since a sentence may have less possible tag sequences than the given
k value, the results may contain a number of sequences smaller
than k.
The relax_tagger module can be tuned with hand written
constraint, but is about 2 times slower than hmm_tagger.
It is not able to produce k best sequences either.
class relax_tagger : public POS_tagger {
public:
/// Constructor, given the constraint file and config parameters
relax_tagger(const std::string &, int, double, double, bool, unsigned int);
/// analyze given sentence.
void analyze(sentence &) const;
/// analyze given sentences.
void analyze(std::list<sentence> &) const;
/// return analyzed copy of given sentence
sentence analyze(const sentence &) const;
/// return analyzed copy of given sentences
std::list<sentence> analyze(const std::list<sentence> &) const;
};
The relax_tagger constructor receives the following parameters:
- The constraint file. The format of the file is described
below. This file can be generated from a tagged corpus using the
script src/utilities/train-tagger/bin/TRAIN.sh provided in
FreeLing package. See src/utilities/train-tagger/README for details.
- An integer stating the maximum number of iterations to wait for
convergence before stopping the disambiguation algorithm.
- A real number representing the scale factor of the constraint weights.
- A real number representing the threshold under which any changes
will be considered too small. Used to detect convergence.
- A boolean stating whether words that carry retokenization
information (e.g. set by the dictionary or affix handling modules)
must be retokenized (that is, splitted in two or more words) after
the tagging.
- An integer stating whether and when the tagger must select only
one analysis in case of ambiguity. Possbile values are: FORCE_NONE (or 0): no selection forced, words ambiguous after
the tagger, remain ambiguous. FORCE_TAGGER (or 1): force
selection immediately after tagging, and before retokenization. FORCE_RETOK (or 2): force selection after retokenization.
The iteration number, scale factor, and threshold parameters are
very specific of the relaxation labelling algorithm. Refer to
[Pad98] for details.
Subsections
Lluís Padró
2013-09-09