The first NER module is the np class, which is a just a FSA that basically detects sequences of capitalized words, taking into account some functional words (e.g. Bank of England) and capitalization at sentence begginings.
It can be instantiated via the ner wrapper described above, or directly via its own API:
class np: public ner_module, public automat { public: /// Constructor, receives a configuration file. np(const std::string &); /// Detect multiwords starting at given sentence position bool matching(sentence &, sentence::iterator &) const; /// analyze given sentence. void analyze(sentence &) const; /// analyze given sentences. void analyze(std::list<sentence> &) const; /// return analyzed copy of given sentence sentence analyze(const sentence &) const; /// return analyzed copy of given sentences std::list<sentence> analyze(const std::list<sentence> &) const; };
The file that controls the behaviour of the simple NE recognizer consists of the following sections:
<FunctionWords>
lists the function words that can be
embeeded inside a proper noun (e.g. preposisions and articles such
as those in ``Banco de España'' or ``Foundation for the Eradication
of Poverty''). For instance:
<FunctionWords> el la los las de del para </FunctionWords>
<SpecialPunct>
lists the PoS tags (according to
punctuation tags definition file, section 3.6) after
which a capitalized word may be indicating just a sentence or clause
beggining and not necessarily a named entity. Typical cases are
colon, open parenthesis, dot, hyphen..
<SpecialPunct> Fpa Fp Fd Fg </SpecialPunct>
<NE_Tag>
contains only one line with the PoS tag that
will be assigned to the recognized entities. If the NE classifier is
going to be used later, it will have to be informed of this tag at
creation time.
<NE_Tag> NP00000 </NE_Tag>
<Ignore>
contains a list of forms (lowercased)
or PoS tags (uppercased) that are not to be considered a named
entity even when they appear capitalized in the middle of a
sentence. For instance, the word Spanish in the sentence
He started studying Spanish two years ago is not a named
entity. If the words in the list appear with other capitalized
words, they are considered to form a named entity (e.g. An
announcement of the Spanish Bank of Commerce was issued
yesterday). The same distinction applies to the word I in
the sentences whatever you say, I don't believe, and That was the death of Henry I.
Each word or tag is followed by a or
indicating whether
the ignore condition is strict (
: non-strict,
:
strict). The entries marked as non-strict will have the
behaviour described above. The entries marked as strict will
never be considered named entities or NE parts.
For instance, the following <Ignore>
section states that
the word ``I'' is not to be a proper noun (whatever you say,
I don't believe) unless some of its neighbour words are ( That was the death of Henry I). It also states that any word
with the RB tag, and any of the listed language names must
never be considered as possible NEs.
<Ignore> i 0 RB 1 english 1 dutch 1 spanish 1 </Ignore>
<Names>
contains a list of lemmas that may be
names, even if they conflict with some of the heuristic criteria
used by the NE recognizer. This is useful when they appear
capitalized at sentence beggining. For instance, the basque name
Miren (Mary) or the nickname Pelé may appear at the
beggining of a Spanish sentence. Since both of them are verbal
forms in Spanish, they would not be considered candidates to form
named entities.
Including the form in the <Names>
section, causes the NE
choice to be added to the possible tags of the form, giving the
tagger the chance to decide whether it is actually a verb or a
proper noun.
<Names> miren pelé zapatero china </Names>
<Affixes>
contains a list of words that may be
part of a NE -either prefixing or suffixing it- even if they are
lowercased. For instance, this is the case of the word don
in Spanish (e.g. don_Juan should be a NE, even if don
is lowercased), or the word junior or jr. in English
(e.g. Peter_Grasswick_jr. should be a NE, even if jr. is lowercased).
The section should containt a word per line, followed by the keyword PRE or SUF stating whether the word may be attached before or after an NE. It a word should be either a prefix or a suffix, it must be declared in two different lines, one with each keyword.
<Affixes> don PRE doña PRE jr. SUF <Affixes>
<RE_NounAdj>
<RE_Closed>
and
<RE_DateNumPunct>
allow to modify the default regular
expressions for Part-of-Speech tags. This regular expressions are
used by the NER to determine whether a sentence-beginning word has
some tag that is Noun or Adj, or any tag that is a closed
category, or one of date/punctuation/number. The default is to
check against Eagles tags, thus, the recognizer will fail to
identifiy these categories if your dictionary uses another tagset,
unless you specify the right patterns to look for.
For instance, if our dictionary uses Penn-Treebank-like tags, we should define:
<RE_NounAdj> ^(NN$|NNS|JJ) </RE_NounAdj> <RE_Closed> ^(D|IN|C) </RE_Closed>
<TitleLimit>
contains only one line with an integer
value stating the length beyond which a sentence written entirely in uppercase will be considered a title and not a proper
noun. Example:
<TitleLimit> 3 </TitleLimit>
If TitleLimit=0
(the default) title detection is
deactivated (i.e, all-uppercase sentences are always marked as
named entities).
The idea of this heuristic is that newspaper titles are usually written in uppercase, and tend to have at least two or three words, while named entities written in this way tend to be acronyms (e.g. IBM, DARPA, ...) and usually have at most one or two words.
For instance, if TitleLimit=3
the sentence
FREELING ENTERS NASDAC UNDER CLOSE OBSERVATION OF MARKET ANALYSTS
will not be recognized as a named entity, and will have its words analyzed
independently. On the other hand, the sentence IBM INC., having less than
3 words, will be considered a proper noun.
Obviously this heuristic is not 100% accurate, but in some cases (e.g. if you are analyzing newspapers) it may be preferrable to the default behaviour (which is not 100% accurate, either).
<SplitMultiwords>
contains only one line with
either yes
or no
. If SplitMultiwords
is
activated Named Entities still will be recognized but they will
not be treated as a unit with only one Part-of-Speech tag for
the whole compound. Each word gets its own Part-of-Speech tag
instead.
NE_Tag
, The Part-of-Speech tags of
non-capitalized words inside a Named Entity (typically,
prepositions and articles) will be left untouched.
<SplitMultiwords> no </SplitMultiwords>
Lluís Padró 2013-09-09