This file can be generated from a tagged corpus using the script src/utilities/train-tagger/bin/TRAIN.sh provided in FreeLing package. See src/utilities/train-tagger/README find out how to use it.
The probabilities file has nine sections: <TagsetFile>
, <UnknownTags>
,
<Theeta>
, <Suffixes>
, <SingleTagFreq>
,
<ClassTagFreq>
, <FormTagFreq>
, <BiassSuffixes>
,
<LidstoneLambda>
.
Each section is closed by
its corresponding tag </TagsetFile>
, </UnknownTags>
, </Theeta>
,
</Suffixes>
, </SingleTagFreq>
, </ClassTagFreq>
,
</FormTagFreq>
, </BiassSuffixes>
, </LidstoneLambda>
.
<TagsetFile>
. This section contains a single
line with the path to a tagset description file (see section
4.1) to be used when computing short versions for PoS
tags. If the path is relative, the location of the lexical
probabilities file is used as the base directory.
<FormTagFreq>
. Probability data of some high frequency forms.
If the word is found in this list, lexical probabilities are
computed using data in <FormTagFreq>
section.
The list consists of one form per line, each line with format:
form ambiguity-class, tag1 #observ1 tag2 #observ2 ...
E.g. japonesas AQ-NC AQ 1 NC 0
Form probabilities are smoothed to avoid zero-probabilities.
<ClassTagFreq>
. Probability data of ambiguity classes.
If the word is not found in the <FormTagFreq>
, frequencies
for its ambiguity class are used.
The list consists of class per line, each line with format:
class tag1 #observ1 tag2 #observ2 ...
E.g. AQ-NC AQ 2361 NC 2077
Class probabilities are smoothed to avoid zero-probabilities.
<SingleTagFreq>
. Unigram probabilities.
If the ambiguity class is not found in the <ClassTagFreq>
, individual
frequencies for its possible tags are used.
One tag per line, each line with format: tag #observ
E.g. AQ 7462
Tag probabilities are smoothed to avoid zero-probabilities.
<Theeta>
. Value for parameter theeta
used in smoothing of tag probabilities based on word suffixes.
If the word is not found in dictionary (and so the list of its
possible tags is unknown), the distribution is computed using the
data in the <Theeta>
, <Suffixes>
, and
<UnknownTags>
sections.
The section has exactly one line, with one real number.
E.g.
<Theeta>
0.00834
</Theeta>
<BiassSuffixes>
. Weighted interpolation factor between
class probability and word suffixes.
The section has exactly one line, with one real number.
E.g.
<BiassSuffixes>
0.4
</BiassSuffixes>
Default value is 0.3.
The probability of the tags belonging to words unobserved in the training corpus, is computed backing off to the distribution of all words with the same ambiguity class. This obviously overgeneralizes and for some words, the estimated probabilities may be rather far from reality.
To palliate this overgeneralization, the ambiguity class probabilities can me interpolated with the probabilities assigned by the guesser according to the word suffix.
This parameter specifies the weight that suffix information is given in the iterpolation,
i.e. if BiassSuffixes=0
only the ambiguity class information is used.
If BiassSuffixes=1
, only the probabilities provided by the guesser are used.
<Suffixes>
. List of suffixes obtained from a
train corpus, with information about which tags were assigned to
the word with that suffix.
The list has one suffix per line, each line with format: suffix #observ tag1 #observ1 tag2 #observ2 ...
E.g.
orada 133 AQ0FSP 17 VMP00SF 8 NCFS000 108
<UnknownTags>
. List of open-category tags to
consider as possible candidates for any unknown word.
One tag per line, each line with format: tag #observ. The tag is the complete label. The count is the number of occurrences in a training corpus.
E.g. NCMS000 33438
<LidstoneLambda>
specifies the The section has exactly one line, with one real number.
E.g.
<LidstoneLambda>
0.2
</LidstoneLambda>
Default value is 0.1.
This parameter is used only to smooth the lexical probabilities of
words that have appeared in the training corpus, and thus are listed
in the <FormTagFreq>
section described above.
Lluís Padró 2013-09-09