The machine-learning based NER module uses a classification algorithm to decide whether each word is at a NE begin (B), inside (I) or outside (O). Then, a simple viterbi algorithm is applied to guarantee sequence coherence.
It can be instantiated via the ner wrapper described above, or directly via its own API:
class bioner: public ner_module { public: /// Constructor, receives the name of the configuration file. bioner ( const std::string & ); /// analyze given sentence. void analyze(sentence &) const; /// analyze given sentences. void analyze(std::list<sentence> &) const; /// return analyzed copy of given sentence sentence analyze(const sentence &) const; /// return analyzed copy of given sentences std::list<sentence> analyze(const std::list<sentence> &) const; };
The configuration file sets the required model and lexicon files, which may be generated from a training corpus using the scripts provided with FreeLing (in folder src/utilities/nerc). Check the README and comments in the scripts to find out what to do.
The most important file in the set is the .rgf file, which contains a definition of the context features that must be extracted for each named entity. The feature rule language is described in section 4.4.
The sections of the configuration file for bioner module are:
<RGF>
contains one line with the path to the
RGF file of the model. This file is the definition of the features
that will be taken into account for NER. These features are
processed by libfries.
<RGF> ner.rgf </RGF>
<Classifier>
contains one line with the kind of
classifier to use. Valid values are AdaBoost
and
SVM
.
<Classifier> Adaboost </Classifier>
<ModelFile>
contains one line with the path to
the model file to be used. The model file must match the
classifier type given in section <Classifier>
.
<ModelFile> ner.abm </ModelFile>The .abm files contain AdaBoost models based on shallow Decision Trees (see [CMP03] for details). You don't need to understand this, unless you want to enter into the code of the AdaBoost classifier.
The .svm files contain Support Vector Machine models generated by libsvm [CL11]. You don't need to understand this, unless you want to enter into the code of libsvm.
<Lexicon>
contains one line with the path to the
lexicon file of the learnt model. The lexicon is used to translate
string-encoded features generated by libfries to
integer-encoded features needed by libomlet. The lexicon file
is generated by libfries at training time.
<Lexicon> ner.lex </Lexicon>The .lex file is a dictionary that assigns a number to each symbolic feature used in the AdaBoost or SVM model. You don't need to understand this either unless you are a Machine Learning student or the like.
<UseSoftMax>
contains only one line with yes or no, indicating whether the classifier output must be
converted to probabilities with the SoftMax function. Currently,
AdaBoost models need that conversion, and SVM models do not.
<UseSoftMax> yes </UseSoftMax>
<Classes>
contains only one line with the classes
of the model and its translation to B, I, O tag.
<Classes> 0 B 1 I 2 O </Classes>
<NE_Tag>
contains only one line with the PoS tag that
will be assigned to the recognized entities. If the NE classifier is
going to be used later, it will have to be informed of this tag at
creation time.
<NE_Tag> NP00000 </NE_Tag>
<InitialProb>
Contains the probabilities of
seeing each class at the begining of a sentence. These probabilities
are necessary for the Viterbi algorithm used to annotate NEs in a
sentence.
<InitialProb> B 0.200072 I 0.0 O 0.799928 </InitialProb>
<TransitionProb>
Contains the transition
probabilities for each class to each other class, used by the
Viterbi algorithm.
<TransitionProb> B B 0.00829346 B I 0.395481 B O 0.596225 I B 0.0053865 I I 0.479818 I O 0.514795 O B 0.0758838 O I 0.0 O O 0.924116 </TransitionProb>
<TitleLimit>
contains only one line with an integer
value stating the length beyond which a sentence written entirely in uppercase will be considered a title and not a proper
noun. Example:
<TitleLimit> 3 </TitleLimit>
If TitleLimit=0
(the default) title detection is
deactivated (i.e, all-uppercase sentences are always marked as
named entities).
The idea of this heuristic is that newspaper titles are usually written in uppercase, and tend to have at least two or three words, while named entities written in this way tend to be acronyms (e.g. IBM, DARPA, ...) and usually have at most one or two words.
For instance, if TitleLimit=3
the sentence
FREELING ENTERS NASDAC UNDER CLOSE OBSERVATION OF MARKET ANALYSTS
will not be recognized as a named entity, and will have its words analyzed
independently. On the other hand, the sentence IBM INC., having less than
3 words, will be considered a proper noun.
Obviously this heuristic is not 100% accurate, but in some cases (e.g. if you are analyzing newspapers) it may be preferrable to the default behaviour (which is not 100% accurate, either).
<SplitMultiwords>
contains only one line with
either yes
or no
. If SplitMultiwords
is
activated Named Entities still will be recognized but they will
not be treated as a unit with only one Part-of-Speech tag for the
whole compound. Each word gets its own Part-of-Speech tag
instead.
NE_Tag
, The Part-of-Speech tags of
non-capitalized words inside a Named Entity (typically,
prepositions and articles) will be left untouched.
<SplitMultiwords> no </SplitMultiwords>
Lluís Padró 2013-09-09